I have my own collection of plugins to give interesting information to Ganglia, a popular distributed cluster monitoring application. This is to supplement the list at sourceforge, which has been closed for a long time and I don't really like most of the metrics there anyways. All of these metrics are aimed at Linux, though some of them will work (or could be ported to) other operating systems. I call each of these scripts once every two minutes from /etc/cron.d/ganglia (don't forget the user field - root).
Without further ado...
ganglia-logtailer is a python framework to gather arbitrary information from a logfile and report it into the ganglia infrastructure using gmetric. For a long time it's been easy to gather metrics from components that allow you to query them (for example, you can ask mysql for the number of questions to get query count). However, applications that don't have a network interface or other method of querying them for state have been a challenge. Nearly everything writes a log, though, so the ganglia-logtailer allows you parse the log for relevant information, aggregate it, and report it to ganglia.
We've included a number of modules with ganglia-logtailer for different logs: apache, unbound, slapd, bind, postfix, and a 'dummy' that you can use as a template for new log formats. Adapting the framework to gather information from a new application involves writing a regular expression to match the log lines you're interested in, filling in the function to collect the data from each line, and finally filling in the function that aggregates that data into values to report via gmetric. ganglia-logtailer itself is then invoked as a daemon or called from cron (how we do it) on a regular basis.
ganglios is a method for integrating ganglia and nagios. Its goal is to provide the (current) data collected by ganglia to nagios in a format that makes it easy to query. When nagios can easily query the ganglia data, it becomes trivial to set up alerts based on data ganglia reports.
Ganglia ramdisk is a collection of a cronjob and a start script that help you set up the gmetad rrds repository on a ramdisk intsead of using physical disk. This helps as your cluster grows. Missing is grub.conf, which should be modified to specify the size of the ramdisk you wish to create. Also missing is switching /var/lib/ganglia/rrds (or wherever your default storage is) to a symlink and pointing it to the ramdisk.
2007-03-08 nfsd_gmetric.pl (local mirror)
nfsd_gmetric.pl is a script written by Vladimir Vuksan that collects information about the NFSd server, including access, getattr, setattr, remove, etc. It reads the info straight out of /proc, which makes it nice and light weight. All his ganglia gmetric scripts can be found at his ganlia gmetric repository.
mcd_gmetric.sh collects metrics from the excellent memcached package. The metrics it collects include total number of connections and items, in addition to the number of gets, sets, and hits per second and the overall hit percentage. Knowing these numbers greatly helps you tune your memcached installation. Please Note: this version of the script only works with memcached version 1.1.13. To change it to work with different versions, the index into the array holding the stats must change; the package reports different stats with each version. Really, the fault lies in my inflexible code which I will fix one of these days, but until then, you must tailor this script to your installed version. (p.s. versions prior to 1.1.13 can't take more than 3GB RAM due to a bug. That's why I use 1.1.13... ;)
Jan Miczaika has written a more stable version (that actually keys off the memcached stat name rather than its postition) that is available on the HitFlip Open Source page. His doesn't report the hit percentage yet, but he is planning on adding some of my stats to his script. So our two scripts are on a convergent path towards perfection... :)
Credit for this script goes to Nicolas Marchildon. ganglia_apache.pl interfaces with apache to report the number of queries per second, sorted by return code (200 = ok, 300 = redirect, etc.) This script is not called from cron, but instead hooks into apache via the following two lines in the httpd.conf file:
LogFormat "%>s" status_only CustomLog "|/path/to/apache-logs-to-ganglia.pl -d 10" status_only
disk_wait_gmetric.sh collects the '%utilization' and 'await' statistics about each device. This script runs for 30s to collect the needed information. It also requires iostat from the sysstat package
mysql_gmetric.sh collects the following metrics from mysql:
- number of threads
- number of queries per second
- number of slow queries per second (though this metric is kinda meaningless)
- seconds a slave is behind the master (if it is a slave)
network_gmetric.sh collects network usage statistics on a per-interface level. This script collects its information directly from /proc/net/dev and looks for devices that are named 'eth', so it will only work on Linux.
sensors_gmetric.sh collects CPU temperature and voltage on a Tyan motherboard. This is my least happy script.
disk_gmetric.sh collects the number of writes and reads per disk. It requires the 'iostat' program, from the sysstat package.