I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.
I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.
I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.
When you check how HDFS sees you cluster it should be something along this lines:
hdfs dfsadmin -printTopology
Rack: /default-rack
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).
In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP
in $SPARK/conf/spark-env.sh
to use the private name as well, and after that change I get NODE_LOCAL
most of the times.
I encounter the same problem today. This is my situation:
My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9
, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7
, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.
I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh
on the name node
SPARK_MASTER_HOST=<name node hostname>
and then adding the following to spark-env.sh
on the data nodes
SPARK_LOCAL_HOSTNAME=<data node hostname>
Don't start slaves like this start-all.sh
. u should start every slave alonely
$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI>