YARN alerts
Descriptions, potential causes and possible remedies for alerts triggered by YARN.
| Alert | Alert Type | Description | Potential Causes | Possible Remedies | 
|---|---|---|---|---|
| App Timeline Web UI | WEB | This host-level alert is triggered if the App Timeline Server Web UI is unreachable. | The App Timeline Server is down. App Timeline Service is not down but is not listening to the correct network port/address. | Check for non-operating App Timeline Server in Ambari Web. | 
| Percent NodeManagers Available | AGGREGATE | This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold. It aggregates the results of DataNode process alert checks. | NodeManagers are down. NodeManagers are not down but are not listening to the correct network port/address. | Check for non-operating NodeManagers. Check for any errors in the NodeManager logs /var/log/hadoop/yarn and restart the NodeManagers hosts/processes, as necessary. Run the netstat-tuplpn command to check if the NodeManager process is bound to the correct network port. | 
| ResourceManager Web UI | WEB | This host-level alert is triggered if the ResourceManager Web UI is unreachable. | The ResourceManager process is not running. | Check if the ResourceManager process is running. | 
| ResourceManager RPC Latency | METRIC | This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for ResourceManager operations. | A job or an application is performing too many ResourceManager operations. | Review the job or the application for potential bugs causing it to perform too many ResourceManager operations. | 
| ResourceManager CPU Utilization | METRIC | This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain thresholds (200% warning, 250% critical). It checks the ResourceManager JMX Servlet for the SystemCPULoad property. This information is only available if you are running JDK 1.7. | Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of an issue in the daemon. | Use the top command to determine which processes are consuming excess CPU. Reset the offending process. | 
| NodeManager Web UI | WEB | This host-level alert is triggered if the NodeManager process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds. | NodeManager process is down or not responding. NodeManager is not down but is not listening to the correct network port/address. | Check if the NodeManager is running. Check for any errors in the NodeManager logs /var/log/hadoop/yarn and restart the NodeManager, if necessary. | 
| NodeManager Health Summary | SCRIPT | This host-level alert checks the node health property available from the NodeManager component. | NodeManager Health Check script reports issues or is not configured. | Check in the NodeManager logs /var/log/hadoop/yarn for health check errors and restart the NodeManager, and restart if necessary. Check in the ResourceManager UI logs /var/log/hadoop/yarn for health check errors. | 
| NodeManager Health | SCRIPT | This host-level alert checks the nodeHealthy property available from the NodeManager component. | The NodeManager process is down or not responding. | Check in the NodeManager logs /var/log/hadoop/yarn for health check errors and restart the NodeManager, and restart if necessary. | 

