The out-of-the-box Nagios alerts displayed in Ambari Web cover a broad range of Hadoop behavior, but often you want to create additional alerts based on the needs of the individual installation. This section provides a high-level description of the process of adding those alerts so that they can be displayed in Ambari Web.
- Step 1: Create a Nagios Plugin Script/Executable
You must begin by creating a Nagios plugin that can check for the particular conditions that you wish to monitor. There are many pre-written plugin scripts available at the Open Source Nagios Plugin project that can be customized for your specific purposes. You can also look at the OOTB plugin scripts that ship with Ambari. The default location for those files on the Nagios server is
/usr/lib64/nagios/plugins/. For more information on creating Nagios plugins see the Nagios Plugin project page at http://nagiosplug.sourceforge.net/developer-guidelines.html.- Step 2: Save Your Plugin to the Plugin Directory on the Nagios Server Machine
The default location is
/usr/lib64/nagios/plugins/.- Step 3: Define the Command to Execute the New Plug-In
In
/etc/nagios/objectsfind and open thehadoop-commands.cfgfile with a text editor. Add the following information to the list:
define command{
command_name my_command_name
command_line $USER1$/my_command_name.sh
$HOSTADDRESS$ $ARG1$ $ARG2$where:
Table 3.1. Define Command
| Variable Name | Variable Definition |
|---|---|
command_name | The comand name |
command_line
| The command with arguments used to launch the command |
Notice that the command_line in the sample includes standard Nagios variables
like $ARG1$ and $HOSTADDRESS$. The variable $USER1$is the
Nagios plugin directory path. Write the full command with arguments down for later use.
- Step 4: Decide Which Hostgroup Your Plugin Should Check
In
/etc/nagios/objectsfind and open thehadoop-hostgroups.cfgfile. Write down thehostgroup_namethat corresponds to the set of hosts your check should run against.- Step 5: Decide Which Servicegroup Your Plugin Belongs To
In
/etc/nagios/objectsfind and open thehadoop-servicegroups.cfgfile. Write down theservicegroup_name that is most applicable, creating your own if necessary. These service groups are helpful in enabling/disabling multiple alerts as a unit using the Nagios Web UI.- Step 6: Define the Alert Entry
In
/etc/nagios/objectsfind and open thehadoop-services.cfgfile. Create a service entry like the following and add it to the list:
define service {
hostgroup_name nagios-server
use hadoop-service
service_description NAGIOS::Nagios status log staleness
servicegroups NAGIOS
check_command check_nagios!10!/var/
nagios/status.dat!/usr/bin/nagios
normal_check_interval 5
retry_check_interval 0.5
max_check_attempts 2
} where:
Table 3.2. Define Service
| Variable Name | Variable Definition |
|---|---|
hostgroup | The name is the name you wrote down in Step 4 |
use | Indicates that this service inherits from hadoop-service. All
services inherit from hadoop-service. |
service_description | The name of the service/alert[a] |
servicegroups | The group name you wrote down in Step 5 |
check_command
| The command_line you entered in the hadoop-commands.cfg file in Step 3[b] |
normal_check_interval | The number of minutes between regularly scheduled checks on the host as long as the check does not change the state |
retry_check_interval | The number of minutes between “retries”[c] |
max_check_attempts
| The maximum number of retry attempts[d] |
[a] Follow the convention of using one of the predefined Hadoop service names as a prefix, followed by double colon and then a short description of the new alert. The service name prefix is used to determine under which Service the alert appears. The list of predefined Hadoop services names includes NAMENODE, HDFS, JOBTRACKER, MAPREDUCE, HBASEMASTER, HBASE, ZOOKEEPER, HIVE-METASTORE, OOZIE, and TEMPLETON. [b] In this format arguments are separated by the “!” character. [c] When a service changes state, Nagios can confirm that state change by retrying the check multiple times. This retry interval can be different than the original check interval. [d] Usually when the state of a service changes, this change is considered “soft” until multiple retries confirm it. After the state change is confirmed, it is considered “hard”. This value indicates the number of attempts that must be made to confirm this state as “hard” and thus to display it. | |
- Step 7: Restart the Server to See the New Alerts
When you have finished making your edits, restart the Nagios service using following command as
rootuser:
service nagios restart

