I have a spark EMR cluster with 1 master and 8 Spot nodes. Today all the nodes dead while running a job, and spark-shell is also not assessable afterwards.
Click the 'Unhealthy Nodes' in hadoop console showing errors 2/4 local-dirs are bad: /mnt/yarn,/mnt3/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
It seems related to the disk space issue in Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"? so I modified yarn-site.xml as described
<property>
<name>yarn.nodemanager.disk-health-checker.enable</name>
<value>false</value>
</property>
and restart related services as described in How to restart Spark service in EMR after changing conf settings?. But the nodes were not back alive.
sudo stop hadoop-yarn-resourcemanager
sudo start hadoop-yarn-resourcemanager
sudo stop spark-history-server
sudo start spark-history-server
sudo status hadoop-yarn-resourcemanager
sudo status spark-history-server