The out-of-the-box Nagios alerts displayed in Ambari Web cover a broad range of Hadoop behavior, but often an administrator will want to create additional alerts based on the needs of the individual installation. This section provides a high-level description of the process of adding those alerts so that they can be displayed in Ambari Web.
- Step 1: Create a Nagios Plugin Script/Executable
- You need to begin by creating a Nagios plugin that can check for the particular conditions that you wish to monitor. There are many pre-written plugin scripts available at the Open Source Nagios Plugin project that can be customized for your specific purposes. You can also look at the OOTB plugin scripts that ship with Ambari. The default location for those files on the Nagios server is - /usr/lib64/nagios/plugins/. For more information on creating Nagios plugins see the Nagios Plugin project page at http://nagiosplug.sourceforge.net/developer-guidelines.html.
- Step 2: Save Your Plugin to the Plugin Directory on the Nagios Server Machine
- The default location is - /usr/lib64/nagios/plugins/.
- Step 3: Define the Command to Execute the New Plug-In
- In - /etc/nagios/objectsfind and open the- hadoop-commands.cfgfile with a text editor. Add the following information to the list:
define command{
  command_name  my_command_name
  command_line $USER1$/my_command_name.sh
              $HOSTADDRESS$ $ARG1$ $ARG2$where:
- command_nameis the command name.
- command_lineis the command with arguments used to launch the command.
Notice that the command_line in the sample includes standard Nagios variables
      like $ARG1$ and $HOSTADDRESS$. The variable $USER1$is the
      Nagios plugin directory path. Write the full command with arguments down for later use.
- Step 4: Decide Which Hostgroup Your Plugin Should Check
- In - /etc/nagios/objectsfind and open the- hadoop-hostgroups.cfgfile. Write down the- hostgroup_namethat corresponds to the set of hosts your check should run against.
- Step 5: Decide Which Servicegroup Your Plugin Belongs To
- In - /etc/nagios/objectsfind and open the- hadoop-servicegroups.cfgfile. Write down the- servicegroup_name that is most applicable, creating your own if necessary. These service groups are helpful in enabling/disabling multiple alerts as a unit using the Nagios Web UI.
- Step 6: Define the Alert Entry
- In - /etc/nagios/objectsfind and open the- hadoop-services.cfgfile. Create a service entry like the following and add it to the list:
 define service {
   hostgroup_name       nagios-server
  use                   hadoop-service
  service_description   NAGIOS::Nagios status log staleness 
  servicegroups         NAGIOS 
  check_command         check_nagios!10!/var/
                          nagios/status.dat!/usr/bin/nagios 
  normal_check_interval 5 
  retry_check_interval  0.5 
  max_check_attempts    2 
               } where:
- hostgroup_name is the name you wrote down in Step 4
- useindicates that this service inherits from- hadoop-service. All services inherit from- hadoop-service.
- service_descriptionis the name of the service/alert.
Follow the convention of using one of the predefined Hadoop service names as a prefix, followed by double colon and then a short description of the new alert. The service name prefix is used to determine under which Service the alert appears. The list of predefined Hadoop services names includes NAMENODE, HDFS, JOBTRACKER, MAPREDUCE, HBASEMASTER, HBASE, ZOOKEEPER, HIVE-METASTORE, OOZIE, and TEMPLETON.
- servicegroupsis the group name you wrote down in Step 5.
- check_commandis the command_line you entered in the- hadoop-commands.cfgfile in Step 3.
Note that in this format, arguments are separated by the “!” character.
- normal_check_intervalis the number of minutes between regularly scheduled checks on the host as long as the check does not change the state.
- retry_check_intervalis the number of minutes between “retries”.
When a service changes state, Nagios can confirm that state change by retrying the check multiple times. This retry interval can be different than the original check interval.
- max_check_attemptsis the maximum number of retry attempts.
Usually when the state of a service changes, this change is considered “soft” until multiple retries confirm it. Once the state change is confirmed, it is considered “hard”. This value indicates the number of attempts that must be made to confirm this state as “hard” and thus to display it.
- Step 7: Restart the Server to See the New Alerts
- When you have finished making your edits, restart the Nagios service using following command as - rootuser:
service nagios restart


