Overview

In our infrastructure, we deploy Netdata within a Docker environment on powerful machines equipped with 64GB of RAM and 10 CPUs. This setup is replicated across more than 40 machines, each running services like PostgreSQL, MongoDB, Tomcat, Apache HTTP Server, and Solr.

Each machine hosts its own Netdata instance, which gathers detailed metrics and forwards them to a centralized Netdata server. Currently, we operate six of these high-capacity machines across two data centers.

Observed Issue

Despite the successful deployment, we have encountered an unusual issue: every 90 minutes, the CPU load spikes dramatically, reaching levels as high as 120. This is significantly above the acceptable threshold for our 10 CPU systems, where a temporary load of around 20 would be manageable.

The elevated load persists for a few minutes before subsiding back to a normal range of 2-4, indicating that most machines remain idle for the majority of the time.

Upon investigation, we found no single process responsible for the high load. Instead, it appears that the Netdata Python scripts across the various machines execute simultaneously, leading to the observed CPU strain.

Monitoring of a Large Server

Steps Taken

To mitigate the issue, we have implemented several adjustments:

  • Disabled most Netdata plugins, retaining only those for monitoring CPU, network, disk, Tomcat, and Apache.
  • Configured the remaining plugins to run every 5 seconds, as increasing the frequency exacerbates the load and prevents the server from returning to normal levels.
  • Disabled plugins for PostgreSQL and MongoDB due to their significant impact on performance, despite our interest in monitoring these services.

Questions and Considerations

We seek guidance on how to further refine the Netdata configuration to prevent these periodic CPU load spikes. Given that we have 40 identical configurations running concurrently, we wonder if the Docker environment combined with Netdata's architecture is contributing to the issue.

The timing of the spikes, occurring every 90 minutes, suggests a potential pattern in how Netdata invokes its plugins. We are uncertain of the underlying cause and would appreciate any insights or recommendations for managing monitoring in such a complex system.