What’s Wrong With Nagios?

Don’t get me wrong, I like Nagios. I think it’s an excellent piece of software and I have spent many years working with it, but I have just completed a proof of concept and gained approval to deploy OpenNMS as a new Enterprise Grade Network Monitoring System. And the main system targeted for replacement here? That’s right, it’s Nagios, which is primarily running via the remote plugin model, using the NRPE daemon to run scripts on remote hosts and report back to base.

Now anyone who has ever played with Nagios will know that it can be a beast of a thing to set up and get working satisfactorily. In fact, most places will devote a good year or so to the process. As a newbie, sitting in front of a freshly installed Nagios instance and wondering how to get it to do something can be an extremely disheartening experience. Once it’s up and running though it’s usually fairly low maintenance to keep it going, and not too difficult to add new devices or custom plugins as you go along. And, for the most part, it is good at what it does, so why would you want to replace it?

Well, despite having almost unparalleled abilities to monitor at the application level and perform any manner of esoteric checks, Nagios does have its limitations.

A Question of Scale

One of the biggest problems I have encountered with various Nagios implementations is one of scale. Put simply, Nagios does not scale well.

Too Much Information, Too Little Visibility

I have seen Nagios implementations monitoring hundreds or even thousands of hosts and services where the corresponding Host Detail and Service Detail screens are simply so big that they refresh themselves before you can scroll even half way down the page.

The Tactical Overview page gives you a simple view into the number of current issues, but doesn’t tell you at a glance what or where they are.

This makes using it in a NOC something of a chore as you actually have to interact with it to get at the information you require. It also has fairly poor visibility into historical data, although this can be addressed to some extent using additional plugins such as perfparse, and it has little to no reporting output - both of which are things The Business tend to like rather a lot.

Timeout, I Tripped Myself Up

Running custom plugins to fit any ad-hoc monitoring requirement might seem like a good idea as you have total control over the requirements and the output, and for what it’s worth, I like writing Nagios plugins, I’ve written them for the NRPE daemon as well as for places where the plugins are installed locally and run over ssh.

In both instances I have seen occasions where the amount of time taken to do a single poll run can take longer than the amount of time taken to gather the results of that poll, and have seen systems come crashing to their knees as a result.

Please Invent Me a Wheel

From my experience this is probably the most misunderstood issue with Nagios; people will spend a long time writing all manner of shell scripts or Perl scripts to plug in to Nagios to return all manner of incredibly useful data, which is all well and good, but most of that information is available already, at significantly less cost (both computational and time), from SNMP.

Yes, Nagios is perfectly capable of polling SNMP, it’s just that I’ve not yet come across anyone who was using it that way by default, and once you have the system set up with dozens or even hundreds of plugins, making the choice to convert to SNMP would be an administrative nightmare.

Security?

And here’s the big one…

Ok, so Nagios allows you to write plugins in just about any language and run them on remote servers. Hands up anyone who sees the problem here? Yes, I trust myself and the integrity of my team to be able to write safe plugins that won’t wipe out the remote host when they are run, but what about somebody who worked here years ago whom I never met? Or some new recruit who may come along after I’m gone? Should I trust their code? Should they be allowed to run ad-hoc scripts on just about any production server they feel like with no checks or balances? Because that’s what will happen. If you are using NRPE then most people bundle all the plugins up and install them on each target system rather than picking and choosing which ones are required, so there is the potential to put a script that has never been tested on a system it was never designed to be run on, and then go ahead and run it anyway. Now I don’t know about you, but that strikes me as a recipe for disaster…

And I won’t even mention the time I saw two plugins written in PHP amongst the usual mix of Perl, shell scripts or binaries. What’s so wrong with that? I hear you ask. Only that it then meant that all of the PHP libraries had to be installed just so those two plugins could run. That included all libraries developed in-house, not just the distribution ones.

It’s Not Exactly Broken, But Please Fix It

Obviously for most organisations, setting up something like Nagios represents a significant investment. Remember, there is no such thing as ‘Free’ software, you still have to pay for the time it takes to set up and maintain it, so you have to have some pretty good reasons to want to replace it.

The main driving factor for doing so here was to provide better visibility into the Operational Infrastructure. This meant not only a clearer interface or dashboard showing at a glance where there are any issues or outages, but also better historical trending information and better reporting.

Actually sourcing a replacement system to fulfil the Business Requirements is no mean feat, but at the end of a 5 month project, OpenNMS was chosen as the best solution. I won’t go into the process here, but suffice it to say it had some pretty stiff competition, especially from the likes of Zenoss and Hyperic HQ.

OpenNMS vs Nagios

So, when it comes down to it, why choose OpenNMS over Nagios?

At the end of the day the differences can be brought down to just three points:

  1. Visibility
  2. Reporting
  3. Scalability

There are other important factors - like auto discovery for instance, which Nagios doesn’t do and which OpenNMS makes incredibly easy: with a few clicks through the GUI you can start monitoring your entire network and collecting data with almost no effort. Obviously you will want more from the system than this, you’ll want to set alerts and thresholds for instance, but at least, unlike Nagios, it is very easy to make it start to do something useful.

In terms of the Business Requirements though, as expressed by the three items listed above, OpenNMS has these in spades; you can create multiple customised dashboards, there is a wealth of out of the box reporting functionality as well as customised report creation, and there is the potential for huge scalability, even running a distributed model across multiple servers or locations. But that’s not enough on its own to sell the idea to Management. What makes OpenNMS a better choice for the enterprise?

OpenNMS describes itself as:

the world’s first enterprise grade network management platform developed under the open source model.

But that doesn’t mean much when it comes to selling a solution to Management, who tend to want to know about things like TCO, ROI, and other such important factors. With commercial propositions it is possible to make these kind of calculations. With an Open Source product this is much more of a grey area. What is it that makes OpenNMS a better proposition than, say, HP’s Openview? Yes, the software is free, but what is the cost involved to set it up and maintain it? Obviously most Linux Sys Admins are capable of picking up just about any Open Source product and running with it, it may just take longer to get your head around some systems and require more TLC to get them working just how you want them - see my comments above about Nagios, and the exact same thing can be said about OpenNMS.

Sometimes, though, you don’t have the luxury of time, so the question then comes down to support. And it just so happens that the makers of OpenNMS have a commercial support company in the shape of The OpenNMS Group, which exists to provide various levels of support agreements and professional services.

There’s Service, and Then There’s Service

If your entire business model is based around the support you can offer for an otherwise free product, and when the competition includes vendor provided solutions where they’ll hold your hand all the way and do everything for you (at the cost of an arm and a leg, while we’re at it), you better be able to provide something pretty special.

And here’s the kicker - the guys from the OpenNMS Group are the guys who actually write and maintain the software. Who would you rather ask your technical questions to? Some slick salesman with a glossy brochure and, if you’re lucky, a service manual, or the guy who actually wrote that particular module or function? Ask Management that question and you know which answer you’re going to get, but ask the techies, the people who have to work with these systems, and they’ll go for the techies and the developers every time.

In the course of my investigations into OpenNMS I’ve spoken to quite a few people, and seen any number of online sources, all of whom are happy to sing the praises of the OpenNMS Group and the exemplary commitment to service and support they show, and now I have to take a moment to add my praises too.

Even after OpenNMS was chosen as the best fit for our requirements there was still significant reluctance to sign off on the project and actually move towards deployment. There were varying levels of concerns about the scale, scope and capabilities of the product, even in the face of all the evidence I had produced over the previous couple of months, with the biggest remaining issue being the ability to accurately replicate what was currently being monitored by Nagios. Of course, I would estimate that approximately 90% of the Nagios functionality could be reproduced out of the box by SNMP, and the rest could be mirrored either by running the Nagios plugins from within OpenNMS or by using some other native means - with a bit of work, obviously. It didn’t matter how many times I said this though, or how much evidence I produced, the people who would actually have to work with the system wanted further reassurances.

With just one day to go before we gave up and pulled the plug on the project, the maintainer of the OpenNMS project, Tarus Balog (The Mouth of OpenNMS), very kindly agreed to a teleconference with us to address some of the concerns being voiced by the Infrastructure guys. Since we’re in Australia he even took time out of his evening for the purpose. After spending over an hour on the phone with us, fielding all of the questions which were put to him, not only did I manage to get the project signed off straight away, I managed to get agreement to engage the OpenNMS Group - in fact Tarus himself, to come to Australia and perform an Enterprise GreenLight Deployment.

Coincidentally that very same day I read a post on Tarus’ blog about the need to improve their marketing strategy. To the people that matter, I’d like to state that the friendliness and helpfulness of the OpenNMS guys is probably their single biggest asset. After talking to Tarus, their dedication to their cause is obvious, and it certainly helped to sell the product as far as the people who sign the paperwork here were concerned. From a techie point of view I have to say their marketing already works just fine, but hey, anything which helps promote these guys and the amazing levels of service they provide is alright with me.

Disclaimer: I am a Technical Specialist and Hired Gun. I have no affiliation with the OpenNMS Group other than through implementing and using their software.

Share This: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • Digg
  • Technorati
  • Facebook
  • TwitThis
  • Mixx
  • Google
  • Sphinn
  • Reddit