I ran several streaming spark jobs and batch spark jobs in the same EMR cluster. Recently, one batch spark job is programmed wrong, which consumed a lot of memory. It causes the master node not response and all other spark jobs stuck, which means the whole EMR cluster is basically down.
Are there some way that we can restrict the maximum memory that a spark job can consume? If the spark job consumes too much memory, it can be failed. However, we do not hope the whole EMR cluster is down.
The spark jobs are running in the client mode with spark submit cmd as below.
spark-submit --driver-memory 2G --num-executors 1 --executor-memory 2G --executor-cores 1 --class test.class s3://test-repo/mysparkjob.jar
'Classification':'yarn-site',
'Properties':{
'yarn.nodemanager.disk-health-checker.enable':'true',
'yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage':'95.0',
'yarn.nodemanager.localizer.cache.cleanup.interval-ms': '100000',
'yarn.nodemanager.localizer.cache.target-size-mb': '1024',
'yarn.nodemanager.pmem-check-enabled': 'false',
'yarn.nodemanager.vmem-check-enabled': 'false',
'yarn.log-aggregation.retain-seconds': '12000',
'yarn.log-aggregation-enable': 'true',
'yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds': '3600',
'yarn.resourcemanager.scheduler.class': 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Thanks!