Server Monitoring Solution Using Grafana, InfluxDB and collectd
Couple of days ago, I wanted to add couple of new nodes to Ganglia deployment I maintained to monitor HTRC services and cluster nodes. Even though everything looks okay after installing and configuring Ganglia monitor daemons in new machines, I couldn’t get them to publish monitoring data to Ganglia’s gmetad. Worse thing was, I couldn’t find any errors (I am not sure whether I looked at the correct location, but I couldn’t find anything). I first tried to install Performance Co-Pilot with Netflix Vector, but couldn’t figure out how to setup a central metric collection server. Even though PCP and Vector combination looked really great, having to type the node host name every time I wanted to monitor a server was not what I wanted.
So I decided to give a try to Grafana, InfluxDB and collectd combination. I was able to get this setup working within couple hours with several dashboards for a subset of servers. Below is a screenshot of one of the dashboards.
In this post, I am going to discuss how to get these three tools working for a scalable and flexible monitoring solution for small scale cluster of nodes.
First of all you have to install latest InfluxDB in one of your nodes and do necessary firewall configurations to open up InfluxDB admin console port and InfluxDB back end port. You can find installation instructions here. Next thing is to install collectd in one of your nodes. If you are using Ubuntu, this is a good document on how to install and configure collectd.
Your next task is to publish stats collected by collectd daemons to InfluxDB instance you just deployed. Sometime back we had to use collectd to InfluxDB proxy to get this done. But since 0.8.4 version InfluxDB supports native collectd protocol. To make this setup work, you have to enable collectd input plugin for InfluxDB. To do this you have to add following configurations to input_plugins section of InfluxDB configuration file /opt/influxdb/shared/config.toml.
[input_plugins.collectd]
enabled = true
port = 8096
database = "collectd"
typesdb = "/usr/share/collectd/types.db"
typesdb definition is used by collectd to understand data it receives. You can copy this file from one of your servers running collectd or you can install collectd on the server you are running InfluxDB. Above example shows the scenario where we have collectd installed on the same server as InfluxDB. Just restart InfluxDB after necessary configurations are done. You can use whatever the port you like in above configuration. Also make sure to create database called collectd in InfluxDB via admin web ui or using REST API like below.
curl -X POST 'http://influxdb-host-name:8086/db?u=root&p=root' \
-d '{"name": "collectd"}'
Next step is to configure collectd to send data directly to InfluxDB instead of storing data as rrd files. In collectd configuration file (/etc/collectd/collectd.con in Ubuntu and /etc/collectd.conf in RedHat EL) enable network plugin and configure it to send metrics to InfluxDB as shown below.
Loadplugin network
...
<Plugin "network">
Server "influxdb-host-name" "8096"
</Plugin "network">
Other thing is you can disable rrdtool plugin because you no longer need that (We are sending metrics to InfluxDB). Restart collectd to make the configuration changes effective.
Next you can go to InfluxDB web ui and list the time series (using ‘list series’ query) you have in your database. If you are getting data from collectd, you should see something like below:
Final step is to deploy Grafana somewhere can create some dashboard widgets to visualize time series data you are interested in like I have shown in the first image. More information on how to configure InfluxDB data source for Grafana can be found here.