Video Transcript
Ganglia is a distributed and highly-scalable cluster monitoring tool. It allows monitoring the memory, disk, CPU usage, network, and other aspects of the cluster’s health. It makes this information available for offline analysis. It means that if some systems are offline for any reason, you will still be able to diagnose the issue with the data reported to ganglia.
Ganglia is composed of three main parts. The first one is the Ganglia Monitor, gmond
, which must run in every node being monitored. The second part is the Ganglia metadata monitor gmetad
, responsible for polling and collecting the remote systems’ data. It uses RRDtool (round-robin database tool ) to store the data, and it is also responsible for publishing the data into the web interface. The third part is the Ganglia web front-end ganglia-web
, which usually resides in the same machine as gmetad
and accesses the RRD files.
In this diagram, we can see how Ganglia’s operated. The nodes send the different stats with gmond
, then gmetad
collects this data and stores it into the RRD files using RDDtool
, which is then published in the web front-end by ganglia-web
.
The installation is very straightforward. In the master node, we install the packages ganglia-gmond
, ganglia-gmetad
, and ganglia-web
. This installation will place default configurations for gmetad
and the webserver. They have to be configured for our purposes:
For
gmetad
edit the configuration file in/etc/ganglia/gmetad.conf
in the following manner:
Define the
data_source
with the name of the cluster, the polling interval (how oftengmetad
is going to reach out to the nodes), and the IP address of thegmetad
host or hosts, in case the first one does not respond.The RRAs parameter defines the round-robin archives. In general, they are described by
RRA:CF:xff:steps:rows
. - CF is the consolidation function, which isAVERAGE
, but it can take values (MIN, MAX, and LAST). - The x files factor (xff), which in this case is 0.5, is the value of permitted missing single data points to calculate a valid consolidated data point. In other words, if the data is consolidated every minute with a polling rate of 1 second, a 0.5 xff means that at least 30 points are needed to calculate the consolidated point using the method declared in the CF. With a 1.0 xff, no data loss is permitted, which will lead to unvalidated consolidated data points (unusable data). Leave it at 0.5. - Steps are the resolution of the file. The actual resolution time issteps * [polling interval]
. In this example is one second and 60 seconds; see, our polling interval is one second. - Rows are the time the archive can hold. The actual time issteps * [polling interval] * rows
. In this example, 24h and two weeks respectively.
setiud_username
is the user owner of the process, and files, in this case, ganglia. (created by the installer)
For
gmond
, the configuration can be edited in/etc/ganglia/gmond.conf
. In this case, we are doing a UNICAST configuration, meaning the master collects all the information from the compute nodes. The other option to set this up is multicast meaning all the nodes collect information from other nodes. Multicast can be advantageous since no data is lost if thegmetad
host fails and makes all the nodes aware of the other nodes’ state. For the unicast mode, configure the UDP send and receive ports and the TCP accept channel.Next, configure the webserver by editing the file
/etc/httpd/conf.d/ganglia.conf
here we add an alias for/ganglia
pointed to/usr/share/ganglia
. In this way, the front-end can be accessed fromhttp://<webserver IP>/ganglia
in my casehttp://172.16.1.1/ganglia
. Also, set some permissions (the can be tuned). Here, everybody has access.Finally, enable and start/restart the
gmond
,gmetad
, andhttpd
services.
In the compute nodes, only ganglia-gmond
is needed. The same gmond.conf
file used in master can be used in the compute nodes. Finally, enable and start/restart the gmond
service.