I've got an EMR cluster that I'm trying to use to execute large text processing jobs, and I had it running on a smaller cluster, however after resizing the master keeps running the jobs locally and crashing due to memory issues.
This is the current configuration I have for my cluster:
[
{
"classification":"capacity-scheduler",
"properties":
{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
},
"configurations":[]
},
{
"classification":"spark",
"properties":
{
"maximizeResourceAllocation":"true"
},
"configurations":[]
},
{
"classification":"spark-defaults",
"properties":
{
"spark.executor.instances":"0",
"spark.dynamicAllocation.enabled":"true"
},
"configurations":[]
}
]
This was a potential solution I saw from this question, and it did work before I resized.
Now whenever I attempt to submit a spark job like this spark-submit mytask.py
I see tons of log entries where it doesn't seem to leave the master host, like so:
17/08/14 23:49:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0,localhost, executor driver, partition 0, PROCESS_LOCAL, 405141 bytes)
I've tried different parameters, like setting --deploy-mode cluster
and --master yarn
, since yarn is running on the master node, but still seeing all the work being done by the master host, while the core nodes sit idle.
Is there another configuration I'm missing, preferably one that doesn't require rebuilding the cluster?