Configuring NodeManagers for Work-Preserving Restart
NodeManager work-preserving enables a NodeManager to be restarted without losing the active containers running on the node. At a high level, the NodeManager stores any necessary state to a local state store as it processes container management requests. When the NodeManager restarts, it recovers by first loading the state for various subsystems, and then lets those subsystems perform recovery using the loaded state.
To configure work-preserving restart for NodeManagers, set the following properties in the yarn-site.xml file on all NodeManagers in the cluster.
Property:
yarn.nodemanager.recovery.enabled
Value:
true
Description:
Enables the NodeManager to recover after a restart.
Example:
<property
    <name>yarn.nodemanager.recovery.enabled</name>
    <value>true</value>
 </property>Property:
yarn.nodemanager.recovery.dir
Value:
<yarn_log_dir_prefix>/nodemanager/recovery-state
Description:
The local file system directory in which the NodeManager will store state information when recovery is enabled.
Example:
<property>
    <name>yarn.nodemanager.recovery.dir</name>
    <value><yarn_log_dir_prefix>/nodemanager/recovery-state</value>
 </property> You should also confirm that the yarn.nodemanager.address port is set to a non-zero value, e.g. 45454:
<property>
    <name>yarn.nodemanager.address</name>
    <value>0.0.0.0:45454</value>
 </property>
