Why are locality levels all ANY in a Spark wordcount application running on HDFS?

Question

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.

I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.

Have you tried configuring [`spark.locality.wait`](http://stackoverflow.com/questions/26994025/whats-the-meaning-of-locality-levelon-spark-cluster)? — Rohan Aletty, Nov 03 '15 at 04:30
Well, since the default value is 3s, you might want to set it higher than that. — Rohan Aletty, Nov 03 '15 at 06:57

score 1 · Answer 1 · answered Dec 03 '15 at 18:56

1

I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.

When you check how HDFS sees you cluster it should be something along this lines:

hdfs dfsadmin -printTopology
Rack: /default-rack
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)

And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).

In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.

answered Dec 03 '15 at 18:56

lpiepiora

13,659
1
35
47

I'm sorry I don't understand your answer clearly. Where should I modify? Can you please give an example? Thanks. – Xingjun Wang Dec 07 '15 at 01:36
Can you add output of `hdfs dfsadmin -printTopology`? – lpiepiora Dec 07 '15 at 04:49
OK, I post it in the question. – Xingjun Wang Dec 08 '15 at 01:30
1

@XingjunWang and the overview page of your cluster? Basically in my case the thing that you have in parenthesis (vm1-vm16 in your case) should be the same as the one spark sees. So you should check your hostname, and the `conf/spark-env.sh` config, etc. – lpiepiora Dec 08 '15 at 08:52

score 1 · Answer 2 · answered Apr 27 '16 at 12:10

1

I encounter the same problem today. This is my situation:

My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.

answered Apr 27 '16 at 12:10

xu Bruce

71
8

1

well, i found another ways to deal with it . you can try to increase spark conf spark.locality.wait to a larger number such as 50, or try to increase spark.locality.wait.process to 30, then you may get your result. – xu Bruce May 05 '16 at 09:44

score 0 · Answer 3 · answered Oct 10 '17 at 00:22

I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh on the name node

SPARK_MASTER_HOST=<name node hostname>

and then adding the following to spark-env.sh on the data nodes

SPARK_LOCAL_HOSTNAME=<data node hostname>

score 0 · Answer 4 · answered Jan 26 '18 at 06:36

0

Don't start slaves like this start-all.sh. u should start every slave alonely

$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI>

answered Jan 26 '18 at 06:36

SmallWong

36
4

Why are locality levels all ANY in a Spark wordcount application running on HDFS?

4 Answers4