Apr 012014
 

Capistrano is an invaluable automation tool, but simultaneously restarting services and hosts can play havoc with your monitoring and alerting. It’s therefore also a good idea to use Capistrano to control your monitoring. In this post I’m going to show how I do it with my Icinga installation.

The way scheduled downtime works in Icinga (these instructions apply to Nagios as well) is described here. Briefly, when downtime is scheduled for a host and/or services, then no alert notifications are sent out during the defined period. Downtime gets scheduled with a start time, an end time, and a duration, and may be either fixed or flexible (the difference is, flexible downtime will start from the moment when a host or service goes down inside of the scheduled period).

Icinga offers a REST API to control it from the command line, but to be honest, it’s a little tricky to use and I can’t work out how to schedule downtime with it, rather than to just disable notifications. So instead, I find the command-file pipe a simpler solution – it requires access to the monitoring server itself, but that’s not really a problem.

The Icinga command-file pipe is a special file known as a FIFO that acts as a pipe into a process, in this case, the daemon itself. Therefore, whatever gets written to this file gets funnelled straight into the Icinga process. It’s location can be found by checking the “command_file” option in the /etc/icinga/icinga.cfg main config file. For example:

/etc/icinga/icinga.cfg:

command_file=/var/spool/icinga/cmd/icinga.cmd

This file is owned by the icinga user, so security is controlled by permissions in that only the icinga user or a member of the icingacmd group can write to this file. For this reason, you’ll need to update /etc/sudoers to permit your Capistrano user to write to the command file as the icinga user. Something like this, using the “tee” command:

/etc/sudoers

ALL=(icinga) NOPASSWD: /usr/bin/tee -a /var/spool/icinga/cmd/icinga.cmd

Then, all that one needs to do is send a formatted string to the command pipe and Icinga will execute the instruction accordingly. The Icinga and Nagios documentation describes the order of each semicolon-separated field.

Setting up a full Capistrano project is beyond the scope of this post, but assuming you have one, the files below will get things working.

This is a helper method to simply execute an Icinga command, for a particular host and taking one parameter, for the number of minutes of downtime.

capistrano/recipes/helpers/icinga_commands.rb:

def icinga_cli_cmd ( icingaCommand, hostName, minutes )
   run <<-CMD
      CMDFILE="/var/spool/icinga/rw/icinga.cmd";
      MINUTES=`expr #{minutes} * 60`
      NOW=`date +%s`;
      STOP=`expr $NOW + $MINUTES`;
      DURATION=$MINUTES;
      AUTHOR="Capistrano";
      COMMENT="Automation";
      CMDLINE="[$NOW] #{icingaCommand};#{hostName};$NOW;$THEN;1;0;$DURATION;$AUTHOR;$COMMENT";
      echo $CMDLINE | sudo -u icinga /usr/bin/tee -a $CMDFILE
   CMD
end

This is the Capistrano recipe to scheduled downtime for a host and all its services.

capistrano/recipes/icinga.rb:

namespace "icinga" do
  task :config do
    close_sessions
    top.load(:string => "set :user, 'deploy'")
  end

  desc "Schedule Icinga Downtime of host and all services"
  task :downtime_host_svc, :roles => :monitor do
    config
    icinga_cli_cmd "SCHEDULE_HOST_SVC_DOWNTIME", "#{hostName}", "#{period}"
  end
end

And within the host definitions themselves, you'll need something like this:

capistrano/deploy/development.rb:

role :monitor, "monitor.example.com"

Your particular setup may involve some tweaking, and this is just an example that can easily be extended to better control Icinga. To invoke this Capistrano recipe from the command line, scheduling 7 minutes of downtime on the "backup" host, execute this command:

$ cap development icinga:downtime_host_svc -s hostName=backup -s period=7

This will send a string similar to this one to the Icinga daemon, and this should also be reflected in the icinga.log file.

[1396341638] SCHEDULE_HOST_SVC_DOWNTIME;backup;1396341638;;1;0;;Capistrano;Automation

Check the effect by viewing the log file and checking the host display in the Icinga web UI.

I use this recipe in a Jenkins job that restores a database snapshot. It means that while the database is being restarted I can suppress warnings. What's the point of being notified of what you already know?


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.

Aug 082012
 

There may be times when you need to forward alerts from one Nagios monitoring system to a second, possibly to a centralised security group who wants oversight over how you’re doing, or your first line support who needs first contact. Fortunately, the Nagios NSCA (Nagios Service Check Acceptor) can be used as a mechanism for doing just this.

I won’t go into too much detail here about NCSA – you’ll need to read the documentation here , but I’ll offer a very brief primer.

NSCA

Unlike NRPE, in which an agent gets polled by the Nagios server, NSCA is comprised of an agent which sends alerts to the server in reaction to events. The NSCA server listens for these events and then passes them to the Nagios server. It is therefore referred to as “passive” monitoring because Nagios is not actively checking.

The mechanism for this is that the NSCA client is a binary executable called send_nsca which connects to the NSCA daemon (nsca, listening on port 5667). Encryption is optional, and of varying security, and both agent and server have a configuration file for setting this. The client send_nsca program sends an alert string to the NSCA daemon which parses it and writes it to the Nagios spool file (a “named pipe” which Nagios watches constantly). As long as the Nagios server recognises the host and service that are specified in the alert (that is, they’re defined in its configuration), then the alert will be registered in Nagios and displayed in the GUI.

In order to use this process for alarm forwarding from a slave Nagios server to an upstream master Nagios, the idea is to use the NCSA client (send_nsca) as a notification command, and the upstream Nagios master as the contact.

The rest of this post will elaborate on these steps.

Configuring Alarm Forwarding

First, a high level description of the process of alarm forwarding using NSCA. In this example, the Nagios system that receives the forwarded alarm is called “OPSMON”.

  1. Alarm event occurs in Nagios due to multiple failures
  2. Notification is triggered
  3. “OPSMON” is one of the alarm contacts, and its service_notification_command is a send_nsca invocation.
  4. The NSCA client, using send_nsca, forwards a NSCA string containing the alarm information to the NSCA server, “OPSMON”.
  5. “OPSMON” is configured to receive alerts from the client host (on which the alarm was detected) and displays the alert accordingly.

In order to implement these operations, first NSCA must be installed and configured – the client and server. For brevity, I’ll assume that the NSCA server is already configured.

NSCA client

The NSCA client is installed on the Nagios host from which alerts will first be raised, and which will forward these alerts on to OPSMON.

Install the NSCA client:

  # yum -y install nsca-client  (Centos/RedHat)
  # apt-get install nsca-client (Ubuntu)

This will create a configuration file called /etc/nagios/send_nsca.cfg. Ensure that this is readable by the nagios user, but preferably not by anyone else, since it will contain a password, if you’re using password encryption.

  # chown nagios:nagios /etc/nagios/send_nsca.cfg
  # chmod 600 /etc/nagios/send_nsca.cfg

You will need to set these values to match whatever is set in the /etc/nagios/nsca.cfg file on the OPSMON server:

  password=
  encryption_method=

Testing NSCA

This can be tested by sending a dummy alert. Check in the Nagios log file (or syslog) on OPSMON to see that the message arrives.

   printf "%s\t%s\t%s\t%s\n" "client-host" "Dummy Service" "2" "Run for it!" | \
       /usr/sbin/send_nsca -H opsmon-host.example.com -c /etc/nagios/send_nsca.cfg

The fields in the text string passed to send_nsca are listed in the configuration section below and will become more clear.

Note that if this was a real alert, then the OPSMON Nagios would need its configuration to contain both a host defined with the name “client-host” and a service of “Dummy Service”. The text of these would need to match exactly.

Configure Nagios to Forward Alarms

Create the following Nagios objects in your configuration – either in one file, or individual files, depending on your setup.

Alert Forwarding Notification Commands

These commands, when executed by Nagios, will use the event environment data (macros) to send an NSCA message to the OPSMON server for service or host events. I like to use the tee command to also send the same text to a log file while debugging.

define command {
   command_name   notify-service-by-nsca
   command_line   /usr/bin/printf "%s\t%s\t%s\t%s\n" "$HOSTNAME$" "$SERVICEDESC$" "$SERVICESTATEID$"    "$SERVICEOUTPUT$|$SERVICEPERFDATA$" | tee -a /tmp/service_alert.log | /usr/sbin/send_nsca -H    $CONTACTADDRESS1$ -c /etc/nagios/send_nsca.cfg 
}

define command {
   command_name   notify-host-by-nsca
   command_line   /usr/bin/printf "%s\t%s\t%s\n" "$HOSTNAME$" "$HOSTSTATEID$" "$HOSTOUTPUT$" | /usr/sbin/send_nsca -H $CONTACTADDRESS1$ -c /etc/nagios/send_nsca.cfg 
}

The variables written like $HOSTNAME$ are what are known as Nagios macros, and during any given event will contain values which relate to the current host and service. Of particular note is $CONTACTADDRESS1$ which gets set in the contact definition, described below.

Contact for Receiving Alerts

Because alerts are being forwarded by means of Notifications, a “Contact” needs to be created to which these notifications will be sent

define contact {
   contact_name OPSMON
   service_notification_period 24x7
   host_notification_period 24x7
   service_notification_options w,u,c,r,f,s 
       ; all service states, flapping events and scheduled downtime events
   host_notification_options d,u,r,f,s 
       ; all host states, flapping events, and scheduled downtime events
   service_notification_commands notify-service-by-nsca
   host_notification_commands notify-host-by-nsca
   address1 opsmon-host.example.com
}

Use the Contact for Service Notifications

The contact OPSMON can now be referenced in service and host definitions. This means that any alert on the service will be sent to the OPSMON contact (the server opsmon-host.example.com), using the send_nsca command.

define service {
   host_name dummy-host
   use generic-service
   check_interval 1
   retry_interval 1
   max_check_attempts 2
   contacts OPSMON
}

Configure Nagios to Receive NSCA alerts

The upstream monitoring server (which I’ve been referring to as OPSMON) is configured to receive alerts passively – that is, it’s not actively polling its checks, it’s receiving events as they happen.

To configure a service to be passive, set in the service definition or template passive_check_enabled to 1, and active_checks_enabled to 0.

Note that for any service to be monitored in the upstream server, the hostname and service description must match exactly what is being sent via the send_nsca command.

This post is a brief example, and is not meant to be exhaustive. It’s hoped that it will provide a starting point for increasing the functions of a Nagios setup.

More information on the Nagios configuration files and object definitions can be found at:
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.

Jun 152012
 

Nagios and Icinga – which for all intents and purposes are the same thing – are wonderful tools. They hold to the Linux tenet of every object as a file, and therefore everything in Icinga is readable text. However one slight flaw is that when things do not behave as expected, it can often be baffling to debug. Specifically, there is no way to verbosely log what commands are actually doing when they’re running tests – and therefore why they’re failing. Nagios and Icinga use various configuration files to generalise execution, but don’t verbosely log what exactly gets run when the checks execute. But there is a wonderful script by Wolfgang Wagner called capture_plugin.pl that admirably solves this problem.

First, visit the website and download the script. Full instructions are provided on the website, but briefly, to use the script, do the following.

Download and install the capture_plugin.pl script to your chosen directory for these kinds of things. The directory /usr/local/bin is suitable.

In the configuration file for the command you want to debug, insert a reference to the capture_plugin.pl Perl script at the beginning of the command_line entry:

command_line     /usr/local/bin/capture_plugin.pl   $USER1$/check_tcp .....

Essentially, what this does is to proxy the check command through the Perl script and capture output. Checking the Perl script source, the default log file is:

my $LOG_FILE = "/tmp/captured-plugins.log";

To complete the change, restart Icinga (or Nagios):

  # service icinga restart

Example Nagios Debugging

As an example, I’ve set up on my Icinga host (monhost) a JMX (for Tomcat) monitoring plugin called check_jmx4perl. This plugin runs on the Icinga host and periodically polls an agent webapp hosted on my Tomcat server. So it’s attempting to connect to a webapp here:
http://client2.example.com:8080/jolokia, but in the Icinga Web GUI, I’m only getting “UNKNOWN” from the plugin.

Checking the icinga logs in /var/log/messages:

Jun 19 00:00:00 monhost icinga: CURRENT SERVICE STATE: 
client2;JVM Thread Count;UNKNOWN;HARD;1;UNKNOWN - 
Error: 500 Error while fetching http://client2.example.com:8080/jolokia/read/java.lang%3Atype%3DThreading/ThreadCount :

Which, let’s face it, isn’t very helpful at all. It shows the plugin output, but Icinga doesn’t report anywhere the full command and its arguments that were executed.

So this time, I’ll reconfigure my definition for the check_jmx command in the Nagios configuration file to now run the capture_plugin, like this:

define command {
 command_name  check_jmx
 command_line  /usr/local/bin/capture_plugin.pl \
    $USER1$/check_jmx4perl -u http://$HOSTADDRESS$:8080/jolokia -m $ARG1$ -a $ARG2$ -p $ARG3$ -w $ARG4$ -c $ARG5$
}

Restart Icinga and check the capture_plugin.pl log file, /tmp/captured-plugins.log, where I find this:

-------
 2012-5-21 16:3:31 ------ debugging
cmd=[/usr/lib64/nagios/plugins/check_jmx4perl '-u' 'http://client2.example.com:8080/jolokia' '-m' 'java.lang:type=Threading' '-a' 'ThreadCount' '-p' '' '-w' '70' '-c' '80']
output=[UNKNOWN - Error: 500 Error while fetching http://perfwap04.popcorn.bskyb.com:8080/jolokia/read/java.lang%3Atype%3DThreading/ThreadCount :

500 Can't connect to client2.example.com:8080 (connect: timeout)
]
retcode=3
-------

So this time, I can actually see what parameters Icinga is passing to the plugin script itself. This means I can then run it myself from the command line, tweak the parameters, and work out what’s going wrong where.

So this isn’t a complete answer to fixing problems with Icinga and Nagios, but capture_plugin.pl is a very useful tool which can rapidly speed up the debugging of monitoring issues.


Matt Parsons is a freelance Linux specialist who has designed, built and supported Unix and Linux systems in the finance, telecommunications and media industries.

He lives and works in London.