We have added these configs on yarn-site.xml
file of our Hadoop-Yarn cluster.
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.recovery.supervised</name>
<value>true</value>
</property>
What's the proper way of decommissioning a NM node which have NM restart recovery feature?
The NM restart recovery feature has been working well, applications not failing even if we restart nodemanager processes. But, when we try to decommission a node by adding the node name to yarn_exclude_hosts file and refreshing nodes on resourcemanager, the applications that had containers running on that node are stuck for a long time and then fail.