[Ilugc] Query: Tools for monitoring servers

From: suraj@xxxxxxxxxxxxx (Suraj Kumar)
Date: Mon, 25 Jun 2012 13:05:29 +0530

On Mon, Jun 25, 2012 at 11:16 AM, K.C. Ramakrishna
<kcramakrishna at yahoo.com> wrote:

Hi all,

We are trying to look into monitoring servers in Beta and Prod environments.

The stack:
a. 2(more in future) Front End apache httpd/tomcat/Liferay Servers
b. Independent CAS-SSO and SOLR servers.
c. standalone server running webservices (written in Java)
d. 2MySQL (Percona) servers 1 WRITE server and both READ.
e. Front end Hardware Load Balancers.

We want to monitor continuously:
a. All the Linux boxes and the services running on them,
b. The performance (and history) of the Java applications too.

We are exploring the best ways to monitor the whole setup including alerts
and automated restarts etc. We are primarily from a development background
with only basic admin experience (basic bash, installations, tuning etc).

I have researched all the usual suspects: Cacti, Nagios, ZenOSS, Zabix?..
JMX seems to be very popular for tomcat/Java.

What are your recommendations for handling this scenario? What are the pros
and cons of various tools and approaches?
Please do share your thoughts on what will be a good solution. All live
examples will be very welcome in educating us.

All the tools you've mentioned above are "systems level" monitoring
applications. While they are also useful, easy to setup and also
needed, they can only give you so much information about your
applications by measuring information about the stuff that is outside
(like the JVM's / the OS's internals). The problem being that the
application we care about is treated like a Black Box. This may be
useful if we had a well behaving / well understood black box. But
usually we don't - even if we do, it changes.

I've found the following general pattern of practical "Dev Ops"
problems occur frequently at work:

Scenario#1:
* The ops says "From <insert system monitoring tool>, we see that
network I/O has increased since last launch. We believe that
performance can increase if our application reduced excessive fetches
and instead chose to cache or preload"
* The dev says "Well, you guys seem to be running <insert operational
monitoring / management tool> on *your* machines for management. It is
not the problem of our application. We are not at fault! We have done
nothing that increases reads. We deny it all!"

(replace network I/O with anything that cannot be pin pointed)

Scenario#2:
* The ops says "From <insert system monitoring tool>, we see that
there was a CPU spike. Any idea what happened? Here are the
application's logs from the time"
* Dev says "We don't know which function/component/part caused it. We
will try to reproduce it in the lab." (and usually, no lab can be as
hairy as reality)

At the last startup I worked (which is no longer a startup and was
serving close to half a million requests per second as of 6 months
ago), we zero'ed in on mondemand to do white box metrics measurement
to tackle the above mentioned frequently recurring scenarios. See
http://mondemand.org/ . We chose LWES as the transport, but that was
based on our setup. YMMV.

mondemand requires instrumenting one's code. mondemand also requires
one-time investment in effort towards collectively brainstorming about
what metrics we want to measure and how we will measure it by putting
the app and business in focus. But once done, the pay off is self
evident due to the black box turning into a 'white box'.

In terms of the above mentioned system tools (zabbix, nagios, etc.,) -
it is my opinion that the incremental advantage over each other may be
negligible since almost all of them provide extensibility (plugins /
extensions).

The distinct advantages of mondemand are:

1. when the app is Java and the system is Unix, there is a large gap
created between the "dev" and the "ops". Unix'y things like signals,
controlling I/O streams (logging), controlling priority or even
measuring memory used, etc., are difficult to achieve. Yes, JMX can
help, somewhat, but again JMX is a means of measuring/controlling the
"Java system" (not the app).
2. As mentioned above, white box measurement is the biggest gain.
3. App can react to outside events - not only within the system but
also networked events (ex: if you restarted the DB, the DB restart
"script" could emit an event and the app can "react" by reestablishing
the connection)

Yes, a small performance penalty for all of this - but, IMHO, it is worth it. :)

cheers,

-Suraj

--
Career Gear - Industry Driven Talent Factory
http://careergear.in/

References:
- [Ilugc] Query: Tools for monitoring servers
  - From: K.C. Ramakrishna

[Ilugc] Query: Tools for monitoring servers

Other related posts: