Video Transcript

Ganglia is a distributed and highly-scalable cluster monitoring tool. It allows monitoring the memory, disk, CPU usage, network, and other aspects of the cluster’s health. It makes this information available for offline analysis. It means that if some systems are offline for any reason, you will still be able to diagnose the issue with the data reported to ganglia.

Ganglia is composed of three main parts. The first one is the Ganglia Monitor, gmond, which must run in every node being monitored. The second part is the Ganglia metadata monitor gmetad, responsible for polling and collecting the remote systems’ data. It uses RRDtool (round-robin database tool ) to store the data, and it is also responsible for publishing the data into the web interface. The third part is the Ganglia web front-end ganglia-web, which usually resides in the same machine as gmetad and accesses the RRD files.

In this diagram, we can see how Ganglia’s operated. The nodes send the different stats with gmond, then gmetad collects this data and stores it into the RRD files using RDDtool, which is then published in the web front-end by ganglia-web.

The installation is very straightforward. In the master node, we install the packages ganglia-gmond, ganglia-gmetad, and ganglia-web. This installation will place default configurations for gmetad and the webserver. They have to be configured for our purposes:

For gmetad edit the configuration file in /etc/ganglia/gmetad.conf in the following manner:

Define the data_source with the name of the cluster, the polling interval (how often gmetad is going to reach out to the nodes), and the IP address of the gmetad host or hosts, in case the first one does not respond.

The RRAs parameter defines the round-robin archives. In general, they are described by RRA:CF:xff:steps:rows. - CF is the consolidation function, which is AVERAGE, but it can take values (MIN, MAX, and LAST). - The x files factor (xff), which in this case is 0.5, is the value of permitted missing single data points to calculate a valid consolidated data point. In other words, if the data is consolidated every minute with a polling rate of 1 second, a 0.5 xff means that at least 30 points are needed to calculate the consolidated point using the method declared in the CF. With a 1.0 xff, no data loss is permitted, which will lead to unvalidated consolidated data points (unusable data). Leave it at 0.5. - Steps are the resolution of the file. The actual resolution time is steps * [polling interval]. In this example is one second and 60 seconds; see, our polling interval is one second. - Rows are the time the archive can hold. The actual time is steps * [polling interval] * rows. In this example, 24h and two weeks respectively.

setiud_username is the user owner of the process, and files, in this case, ganglia. (created by the installer)

For gmond, the configuration can be edited in /etc/ganglia/gmond.conf. In this case, we are doing a UNICAST configuration, meaning the master collects all the information from the compute nodes. The other option to set this up is multicast meaning all the nodes collect information from other nodes. Multicast can be advantageous since no data is lost if the gmetad host fails and makes all the nodes aware of the other nodes’ state. For the unicast mode, configure the UDP send and receive ports and the TCP accept channel.
Next, configure the webserver by editing the file /etc/httpd/conf.d/ganglia.conf here we add an alias for /ganglia pointed to /usr/share/ganglia. In this way, the front-end can be accessed from http://<webserver IP>/ganglia in my case http://172.16.1.1/ganglia. Also, set some permissions (the can be tuned). Here, everybody has access.
Finally, enable and start/restart the gmond, gmetad, and httpd services.

In the compute nodes, only ganglia-gmond is needed. The same gmond.conf file used in master can be used in the compute nodes. Finally, enable and start/restart the gmond service.