Wed 2 Jul 2008
OpenNMS vs Nagios
Posted by Craig under Tech
What’s Wrong With Nagios?
Don’t get me wrong, I like Nagios. I think it’s an excellent piece of software and I have spent many years working with it, but I have just completed a proof of concept and gained approval to deploy OpenNMS as a new Enterprise Grade Network Monitoring System. And the main system targeted for replacement here? That’s right, it’s Nagios, which is primarily running via the remote plugin model, using the NRPE daemon to run scripts on remote hosts and report back to base.
Now anyone who has ever played with Nagios will know that it can be a beast of a thing to set up and get working satisfactorily. In fact, most places will devote a good year or so to the process. As a newbie, sitting in front of a freshly installed Nagios instance and wondering how to get it to do something can be an extremely disheartening experience. Once it’s up and running though it’s usually fairly low maintenance to keep it going, and not too difficult to add new devices or custom plugins as you go along. And, for the most part, it is good at what it does, so why would you want to replace it?
Well, despite having almost unparalleled abilities to monitor at the application level and perform any manner of esoteric checks, Nagios does have its limitations.
A Question of Scale
One of the biggest problems I have encountered with various Nagios implementations is one of scale. Put simply, Nagios does not scale well.
Too Much Information, Too Little Visibility
I have seen Nagios implementations monitoring hundreds or even thousands of hosts and services where the corresponding Host Detail and Service Detail screens are simply so big that they refresh themselves before you can scroll even half way down the page.
The Tactical Overview page gives you a simple view into the number of current issues, but doesn’t tell you at a glance what or where they are.
This makes using it in a NOC something of a chore as you actually have to interact with it to get at the information you require. It also has fairly poor visibility into historical data, although this can be addressed to some extent using additional plugins such as perfparse, and it has little to no reporting output - both of which are things The Business tend to like rather a lot.
Timeout, I Tripped Myself Up
Running custom plugins to fit any ad-hoc monitoring requirement might seem like a good idea as you have total control over the requirements and the output, and for what it’s worth, I like writing Nagios plugins, I’ve written them for the NRPE daemon as well as for places where the plugins are installed locally and run over ssh.
In both instances I have seen occasions where the amount of time taken to do a single poll run can take longer than the amount of time taken to gather the results of that poll, and have seen systems come crashing to their knees as a result.
Please Invent Me a Wheel
From my experience this is probably the most misunderstood issue with Nagios; people will spend a long time writing all manner of shell scripts or Perl scripts to plug in to Nagios to return all manner of incredibly useful data, which is all well and good, but most of that information is available already, at significantly less cost (both computational and time), from SNMP.
Yes, Nagios is perfectly capable of polling SNMP, it’s just that I’ve not yet come across anyone who was using it that way by default, and once you have the system set up with dozens or even hundreds of plugins, making the choice to convert to SNMP would be an administrative nightmare.
Security?
And here’s the big one…
Ok, so Nagios allows you to write plugins in just about any language and run them on remote servers. Hands up anyone who sees the problem here? Yes, I trust myself and the integrity of my team to be able to write safe plugins that won’t wipe out the remote host when they are run, but what about somebody who worked here years ago whom I never met? Or some new recruit who may come along after I’m gone? Should I trust their code? Should they be allowed to run ad-hoc scripts on just about any production server they feel like with no checks or balances? Because that’s what will happen. If you are using NRPE then most people bundle all the plugins up and install them on each target system rather than picking and choosing which ones are required, so there is the potential to put a script that has never been tested on a system it was never designed to be run on, and then go ahead and run it anyway. Now I don’t know about you, but that strikes me as a recipe for disaster…
And I won’t even mention the time I saw two plugins written in PHP amongst the usual mix of Perl, shell scripts or binaries. What’s so wrong with that? I hear you ask. Only that it then meant that all of the PHP libraries had to be installed just so those two plugins could run. That included all libraries developed in-house, not just the distribution ones.
It’s Not Exactly Broken, But Please Fix It
Obviously for most organisations, setting up something like Nagios represents a significant investment. Remember, there is no such thing as ‘Free’ software, you still have to pay for the time it takes to set up and maintain it, so you have to have some pretty good reasons to want to replace it.
The main driving factor for doing so here was to provide better visibility into the Operational Infrastructure. This meant not only a clearer interface or dashboard showing at a glance where there are any issues or outages, but also better historical trending information and better reporting.
Actually sourcing a replacement system to fulfil the Business Requirements is no mean feat, but at the end of a 5 month project, OpenNMS was chosen as the best solution. I won’t go into the process here, but suffice it to say it had some pretty stiff competition, especially from the likes of Zenoss and Hyperic HQ.
OpenNMS vs Nagios
So, when it comes down to it, why choose OpenNMS over Nagios?
At the end of the day the differences can be brought down to just three points:
- Visibility
- Reporting
- Scalability
There are other important factors - like auto discovery for instance, which Nagios doesn’t do and which OpenNMS makes incredibly easy: with a few clicks through the GUI you can start monitoring your entire network and collecting data with almost no effort. Obviously you will want more from the system than this, you’ll want to set alerts and thresholds for instance, but at least, unlike Nagios, it is very easy to make it start to do something useful.
In terms of the Business Requirements though, as expressed by the three items listed above, OpenNMS has these in spades; you can create multiple customised dashboards, there is a wealth of out of the box reporting functionality as well as customised report creation, and there is the potential for huge scalability, even running a distributed model across multiple servers or locations. But that’s not enough on its own to sell the idea to Management. What makes OpenNMS a better choice for the enterprise?
OpenNMS describes itself as:
the world’s first enterprise grade network management platform developed under the open source model.
But that doesn’t mean much when it comes to selling a solution to Management, who tend to want to know about things like TCO, ROI, and other such important factors. With commercial propositions it is possible to make these kind of calculations. With an Open Source product this is much more of a grey area. What is it that makes OpenNMS a better proposition than, say, HP’s Openview? Yes, the software is free, but what is the cost involved to set it up and maintain it? Obviously most Linux Sys Admins are capable of picking up just about any Open Source product and running with it, it may just take longer to get your head around some systems and require more TLC to get them working just how you want them - see my comments above about Nagios, and the exact same thing can be said about OpenNMS.
Sometimes, though, you don’t have the luxury of time, so the question then comes down to support. And it just so happens that the makers of OpenNMS have a commercial support company in the shape of The OpenNMS Group, which exists to provide various levels of support agreements and professional services.
There’s Service, and Then There’s Service
If your entire business model is based around the support you can offer for an otherwise free product, and when the competition includes vendor provided solutions where they’ll hold your hand all the way and do everything for you (at the cost of an arm and a leg, while we’re at it), you better be able to provide something pretty special.
And here’s the kicker - the guys from the OpenNMS Group are the guys who actually write and maintain the software. Who would you rather ask your technical questions to? Some slick salesman with a glossy brochure and, if you’re lucky, a service manual, or the guy who actually wrote that particular module or function? Ask Management that question and you know which answer you’re going to get, but ask the techies, the people who have to work with these systems, and they’ll go for the techies and the developers every time.
In the course of my investigations into OpenNMS I’ve spoken to quite a few people, and seen any number of online sources, all of whom are happy to sing the praises of the OpenNMS Group and the exemplary commitment to service and support they show, and now I have to take a moment to add my praises too.
Even after OpenNMS was chosen as the best fit for our requirements there was still significant reluctance to sign off on the project and actually move towards deployment. There were varying levels of concerns about the scale, scope and capabilities of the product, even in the face of all the evidence I had produced over the previous couple of months, with the biggest remaining issue being the ability to accurately replicate what was currently being monitored by Nagios. Of course, I would estimate that approximately 90% of the Nagios functionality could be reproduced out of the box by SNMP, and the rest could be mirrored either by running the Nagios plugins from within OpenNMS or by using some other native means - with a bit of work, obviously. It didn’t matter how many times I said this though, or how much evidence I produced, the people who would actually have to work with the system wanted further reassurances.
With just one day to go before we gave up and pulled the plug on the project, the maintainer of the OpenNMS project, Tarus Balog (The Mouth of OpenNMS), very kindly agreed to a teleconference with us to address some of the concerns being voiced by the Infrastructure guys. Since we’re in Australia he even took time out of his evening for the purpose. After spending over an hour on the phone with us, fielding all of the questions which were put to him, not only did I manage to get the project signed off straight away, I managed to get agreement to engage the OpenNMS Group - in fact Tarus himself, to come to Australia and perform an Enterprise GreenLight Deployment.
Coincidentally that very same day I read a post on Tarus’ blog about the need to improve their marketing strategy. To the people that matter, I’d like to state that the friendliness and helpfulness of the OpenNMS guys is probably their single biggest asset. After talking to Tarus, their dedication to their cause is obvious, and it certainly helped to sell the product as far as the people who sign the paperwork here were concerned. From a techie point of view I have to say their marketing already works just fine, but hey, anything which helps promote these guys and the amazing levels of service they provide is alright with me.
Disclaimer: I am a Technical Specialist and Hired Gun. I have no affiliation with the OpenNMS Group other than through implementing and using their software.
10 Responses to “ OpenNMS vs Nagios ”
Comments:
Leave a Reply
You must be logged in to post a comment.
Trackbacks & Pingbacks:
-
Pingback from Vandebilt.com » Monitoring applications
September 22nd, 2008 at 7:31 am[...] OpenNMS vs Nagios [...]









July 2nd, 2008 at 11:57 pm
Thanks a bunch, Craig! I’ve written 3 different comments in response to this and deleted them before hitting “Submit” because (ugh) I end up sounding like a sales guy.
Just wanted to know your take on OpenNMS is appreciated, and I’m glad you “get it” — this is exactly the kind of experience we’re trying to make sure all of our customers and users get.
July 3rd, 2008 at 4:37 am
I´m absolute your opinion. I have strong experience with Nagios and now with OpenNMS. I have decided to do my Open-Source Projects with OpenNMS for the same reasons in your posting. Additional to this posting, it is really hard to maintain a large Nagios-installation. In Nagios there are absolutly no procedures and processes which help you to bring your monitoring up-to-date. All changes in the network must be done manually. The discovery possibilities in OpenNMS (capability-scan and discovery) works really good for that. And last not least, Nagios can do absolutly nothing with external commands like SNMP-Traps and Syslogs. The implementations with snmptt and so on –> !!! ROFL !!!
The same with notification for that –> Netways-Implementation for notify SNMP-Traps –> ROFL too
Look at http://www.netways.de/uploads/media/Martin.Fuerstenau_SNMP.Traphandling.fuer.Nagios.pdf
than you see why this sucks
July 12th, 2008 at 10:53 pm
I absolutely agree with your comments about the OpenNMS group.
I’m hoping to have the pleasure to collaborate with them on an interesting project. They have worked enormously hard to deliver a great software product and develop their expertise.
We’ve chosen to work with OpenNMS and invest in further development because of their product - and their quality service. Well done Tarus, Jeff, David and the rest of the crew.
Of course, a little marketing polish wouldn’t hurt - as long as you understand it to mean -> engaging in a meaningful two-dialog with everyone involved in the process - budget holders, managers, engineers.
July 13th, 2008 at 12:11 pm
I believe the open source community needs to re-focus the way it engages it’s audience (their market, in commercial terms).
We need businesses to support us. To have faith & confidence in us as a community of experts - not as a bunch of geeks with a grudge.
We need to polish our messages - not to go “corporate”, but talk to them as peers. To engage in a dialog that they understand and feel welcomed by.
With the support of the non-technical sides of the business - so we get more involved. Our status as trusted advisors that we *really* looking out for their interests - not to fleece them via quasi-legal lock-ins.
That requires us all to invest in learning how to present ourselves. A “commercial” & a “community” website along side materials, webinars, demos and fully-enabled professional partner programmes to support the delivery of quality services - globally.
We need to compete head-to-head with HP, BMC, CA, EMC, IBM and the mid-tier too. We need to give those engineers that want to deploy our solutions the tools to be agents of change. They need business cases, support packages, structured professional services - they need the experience they receive from the traditional proprietary channel.
That’s a long way from where we are today - but at least the software is already there.
November 21st, 2009 at 4:49 am
Hi Craig!
I am not a techie and know very little about Nagios/ONMS etc - but after reading your blog, I thought perhaps you might have an opinion on the following.
From what I understand about ONMS (at least based upon how the older version was deployed at work here), it seems that the KPI metrics are stored by default in JRRD while newer non - KPI related (e.g. availability management) data has been stored in Postgres.
A fellow in our department was looking at a number of our environment tools recently (seems that ONMS has not been deployed across all of them so we have half a dozen in use in production today) and was considering doing a small proof of concept re: an interim solution (creating of a federated cdb to yield a common consolidated view across all these areas) until such time as an enterprise solution became affordable.
He then said that he would like to communicate with the historical data (or a subset) from all of these tools’ disparate CDBs created from disparate tools (most are Oracle odbc compliant) but that he would not be able to communicate with ONMS’s KPI data due to it being stored in JRRD.
So, I have two questions:
a) have you encountered such an issue re: JRRD and is there an easy way to export/convert the JRRD data into another format - perhaps Postgres? Or is it easier to first export the desired data from JRRD to xml format first, then load it into an odbc complaint rdbms?
b) have you had much/any experience with creating an enterprise cdb to tie together disparate remote cdb data?
c) I have looked on the web at enterprise cdb solutions and came across TeamQuest’s Enterprise CDB solution. Are you aware of what other folks are doing in the industry to address this enterprise cdb issue?
Many tks in advance for whatever insight you can supply…
November 25th, 2009 at 9:00 am
Hi Steve,
To answer your question about JRRD first:
JRRD is not a data store in itself, it is a java interface to rrdtool, which stores its data in rrd files.
I’m guessing your OpenNMS installation is pretty old as JRRD was replaced by JRobin as far back as version 1.3.2 if memory serves.
Regardless - it is possible to extract the data from RRD files as XML as you state, so you could then convert this to whatever format you require if you wanted to do something else with it. I believe there might be a few tools out there to tie RRD files into some ODBC compliant form, but generally this is regarded as unnecessary and probably more trouble than it is worth. Hopefully the reasons for this will become apparent below…
It’s been a while since I was active in this area so I’m not sure what the latest developments are, I’m also not sure what you’re referring to with regard to a federated CDB or an enterprise solution.
OpenNMS is an enterprise solution. It is designed to be an enterprise grade Network Management System, so if this is what you mean by a federated CDB all I can say is just because you have to pay for something it doesn’t necessarily make it better.
Rather than spending time and money trying to find an expensive solution to pull in data from a variety of different monitoring systems, might it not be a better approach to look at consolidating your half dozen or so disparate systems into a single platform to monitor the whole network?
Having worked in a number of places which sound like yours, where various different products have been installed by a succession of different Systems Administrators, largely on a whim, and which are usually then not maintained, it quickly becomes a real headache to manage. The best approach is generally to get rid of them all and start with a clean sheet. Trying to tie them all together usually just adds even more complexity.
If the solutions you are looking at are not affordable, as you say, I strongly encourage you to spend some time looking into the capabilities of OpenNMS.
If budget is available I’d also strongly encourage you to talk to the guys at OpenNMS and see what they can do to help in terms of support.
I’m not sure if that answers your questions but I hope its some useful information.
November 26th, 2009 at 11:11 pm
Thanks for your feedback, Craig.
It makes a lot of sense re: an enterprise solution like ONMS across the entire infra.
Don’t know what mgt will go for re: funding, but have narrowed down our options to 3 now - i.e.
a) build a common cdb as hub to multiple disparate cdb’s in use today (so this option would be “fit for purpose and use” but likely would be a long, complex, costly and questionably supportable process given the multitude of disparate monitoring tools that would still be left in use - with their disparate cdbs to be joined into this one overall home grown solution),
b) buy (vet reqts and cost first) a vendor cdb product that is best at joining disparate cdbs - e.g. hyperformix data manager, bmc or teamquest cdb software products - this could end up being more practical as opposed to in-house development/supportability (but would still require the same work effort re: cdb requirement gathering, gap analysis, may still fall short re: filling the need, do little to align the disparate monitoring tools in place and possibly be more expensive than one common open source monitoring tool using its one cdb) or
c) buy (vet reqts and cost first) into an open source enterprise, scalable, self-discovering, visible, fully featured, “easier than most others to configure” monitoring solution like OpenNMS (gtreat erporting capability and fully supported at a reasonable rate). I was considering whether to table this third option a few weeks ago as option #3, but was hesitant to do so due to the sheer challenge of aligning so many ingrained groups - until I read your blog yesterday. So it’s back on the table again. (I formed this option after some in-house demos of both nagios and OpenNMS, then reading your blog comparing both).
You will likely know which one of the three I am leaning towards - the latter is the best long term (5 year?) strategy to follow - it is simpler since there is only one tool, one cdb in the end and the option may even prove to be financially attractive re: ROI etc. - however as I have inferred, this third option would require an incredible amount of buy-in from all infrastructure sectors, much more time to nail monitoring, common cdb requirements - and likely more time (and thus money) to deploy.
Oh well, I will let you know which of the 3 above, if any (since option 4 would be “status quo”), mgt decides to go with.
Once again, thanks for the feedback …
November 27th, 2009 at 2:34 am
Oh yes - I forgot to ask a very important question (yet hopefully a very redundant question) - how appropriate is ONMS for overall network monitoring. I am hopeful that you will respond that its primary network management purpose is to monitor “networks” - as opposed to simply being used to monitore say just storage, or just mid range cpu?
The reason I ask is that it would seem that across our dozen different infrastructure areas - ONMS seems to be (or is currently planned to be) targeted to those areas other than the network itself (such as office automation, mid ranges services, mail and messaging), but not to network itself (lan/wan/man network, firewall, vpn services, url filtering,etc) . Network seems to have a myriad of alternate tools - so before asking network why ONMS is not on radar, I couldn’t help but wonder if ONMS had some major network feature gaps/issues.
Tks again!!!
November 27th, 2009 at 4:40 am
More specifically, I would really want to know if you think it handles the following monitoring:
- # of ports,
- CPU utilization on network devices,
- memory utilization,
- bandwidth capacity on links,
- port utilization capacity,
- application trending for traffic,
- application trending for response times and
- SLA compliance