Nagios Doesn’t Suck (As much as people think)

My predecessor at my current company used a platform called check_mk to monitor our network.  Unfortunately, check_mk has a feature that populates based on network discovery and can be very chatty. check_mk is also very convoluted as it’s built on top of Icinga, which is built on top of Nagios.  When making changes, there were layers and layers of configuration files you had to dig through, at least, in the check_mk instance my predecessor had bequeathed me.  Needless to say, I was not a fan and it wasn’t very efficient.  I understand why they forked to create Icinga.  At the time, Nagios was stagnant.  Since then, I feel like the Nagios camp has progressed significantly.  I also understand why they forked Icinga to check_mk, but it’s not for me.   For the granularity I want in monitoring, check_mk would be more work intensive than Nagios.

We had an unfortunate, but serendipitous, outage that destroyed our check_mk server.  This “forced” me to build a new monitoring platform from the ground up as backups had never been configured for that specific server.  At previous companies, I have used the gambit of monitoring platforms; everything from Icinga to check_mk, AppFirst (ScienceLogic) to What’s Up Gold.  While I know that it receives a lot of shade from the tech community at large, I chose Nagios.  I did this for three reasons.  First, it allows me to granularly monitor what I want, how I want.  Secondly, it’s very easy to configure.  Lastly, it has some seamless integrations with Slack, which we use heavily at the moment, to automate notifications.  I’m using some scripts that I picked up from the GitHub page of RunLevel Consulting. The only dependency is curl, which literally every Linux instance should have installed anyway. If you have any issues with the code, feel free to reach out to those folks. They were extraordinarily helpful in helping me debug some issues.

As is the standard in my company now, we use Devuan Linux; we don’t want any of that nasty systemd malware in our midst.  I built a basic Devuan VM and installed Nagios from the repository.  I’m still on Devuan Jessie, so the install of Nagios was a little aged at Nagios Core 3.5.1.  Regardless, everything worked out of the box.  I populated my cfg files with my servers, network devices, and hostgroups, slapped in the scripts to getting Slack alerting active, and I was off to the races.  At this point, I was only getting up/down, ssh, and http statuses, but it was a good start.  After that, I configured NRPE on all of our production servers for disk space, service monitoring, memory usage, and load.   The previously configured Slack alerts worked flawlessly with the NRPE checks.  Obviously, it’s going to take a few weeks of fine tuning, tweaking, and wrinkle ironing to get it perfect.  However, it’s a fairly painless process to get it installed, configured and alerting.  If you’re looking for a monitoring platform for your business, don’t overlook Nagios just because it’s “old” or “not pretty.”    Nagios has a very stable code base, it’s in literally almost every Linux distribution repository, and any systems administrator and/or engineer worth their salt has experience with it.  It’s a reliable, boring tool for a boring, but crucial, task.

nagios_layout

Leave a Reply

Your email address will not be published. Required fields are marked *