3

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.

I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.

enter image description here

enter image description here

Xingjun Wang
  • 413
  • 2
  • 4
  • 17

4 Answers4

1

I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.

When you check how HDFS sees you cluster it should be something along this lines:

hdfs dfsadmin -printTopology
Rack: /default-rack
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
   172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)

And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).

In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.

lpiepiora
  • 13,659
  • 1
  • 35
  • 47
  • I'm sorry I don't understand your answer clearly. Where should I modify? Can you please give an example? Thanks. – Xingjun Wang Dec 07 '15 at 01:36
  • Can you add output of `hdfs dfsadmin -printTopology`? – lpiepiora Dec 07 '15 at 04:49
  • OK, I post it in the question. – Xingjun Wang Dec 08 '15 at 01:30
  • 1
    @XingjunWang and the overview page of your cluster? Basically in my case the thing that you have in parenthesis (vm1-vm16 in your case) should be the same as the one spark sees. So you should check your hostname, and the `conf/spark-env.sh` config, etc. – lpiepiora Dec 08 '15 at 08:52
1

I encounter the same problem today. This is my situation:

My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.

xu Bruce
  • 71
  • 8
  • 1
    well, i found another ways to deal with it . you can try to increase spark conf spark.locality.wait to a larger number such as 50, or try to increase spark.locality.wait.process to 30, then you may get your result. – xu Bruce May 05 '16 at 09:44
0

I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh on the name node

SPARK_MASTER_HOST=<name node hostname>

and then adding the following to spark-env.sh on the data nodes

SPARK_LOCAL_HOSTNAME=<data node hostname>
supamanda
  • 199
  • 2
  • 3
0

Don't start slaves like this start-all.sh. u should start every slave alonely

$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI> 
SmallWong
  • 36
  • 4