Count operation resulting in more rack_local pyspark

Question

I am trying to understand the locality level on Spark cluster and its relationship with the RDD number of partitions along with the action perform on it. Specifically, I have a dataframe where the number of partitions are 9647. Then, I performed df.count on it and observed the following in the Spark UI:

A bit of context, I submitted my job to Yarn cluster with the following configuration:

- executor_memory='10g',
- driver_memory='10g',
- num_executors='5',
- executor_cores=5'

Also, I noticed that all the executors were coming from 5 different nodes (hosts).

From the figure, I found that from all 9644 tasks, more than 95% were not run within the same node. So, I am just wondering the reason for having having a lot of rack_local. Specifically, why don't the node chose the closest data source to execute, in other words, having more node local?

Thank you

How many computer and how many core & memory for each computer ? — howie, Apr 12 '19 at 14:08

score 1 · Accepted Answer · answered Apr 14 '19 at 19:29

Here are some points to consider.

Below you can find some of the factors that affect the data locality in Spark:

Initially Spark will try to place the task as close as possible to the node where the source data exists. For instance if the source system is HDFS, Spark will try to execute the task in the same node where the data of the specific partition exists. Spark will find the preferred location for each RDD by implementing the getPreferredLocations. Later on the TaskScheduler will leverage this information to decide about the locality of the task. In the definition of the RDD you can find the definition of the getPreferredLocations which is responsible for specifying the optimal location of the RDD. For example, if the source is HDFS Spark will create an instance of HadoopRDD (or NewHadoopRDD) and it will access the Hadoop API to retrieve the information about the location of the source files overriding the getPreferredLocations function from its base class.
The main reason not to be able to achieve a high locality eg: PROCESS_LOCAL or NODE_LOCAL is the lack of resources in the target node. Spark uses the setting spark.locality.wait to set the waiting time that the decision about the level of locality should be taken. Spark will use this setting to wait for a particular time for resources to become available. If after the expiration of the spark.locality.wait interval there are no resources (cores) available on the node then Spark will downgrade the locality level eg: PROCESS_LOCAL -> NODE_LOCAL the same will happen with the new downgraded level until the required resource specs are met. On the other side one way to upgrade the task locality is to add more resources eg: add a new executor. The tests found here (line 915) demonstrates this scenario. The default value is 3sec, if you think you should give more time to your tasks you might decide to increase this value although is not suggested (can increase inefficiently the Spark idle time).
In the case that your data lives outside of Spark cluster then locality level will be set to ANY.

My final advice to improve the locality would be to make Spark aware of the location of the partitions by using repartition() + persist() or cache().

Note: that the persistence will take effect after the first call of an action.

Useful links:

https://www.waitingforcode.com/apache-spark/spark-data-locality/read

http://www.russellspitzer.com/2017/09/01/Spark-Locality/

https://github.com/apache/spark/blob/0bb716bac38488bc216fbda29ce54e93751e641b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

what a detailed explanations, thanks a lot!. regarding this part `My final advice to improve the locality would be to make Spark aware of the location of the partitions by using repartition() + persist() or cache().` I have two follow up questions: 1) is there a trade-off between the locality and data skewness in the case of data repartition? Specifically, since my data is very skewed, I am afraid it will make things worst if I repartition it wrongly. 2) regarding the persist()/cache(), what if this DF will only be used once, is it worth to do the caching? — bohr, Apr 15 '19 at 03:49
Thanks bohr! 1) yes there is, create a random value column and repartition based on it as shown [here](https://stackoverflow.com/questions/55384810/skewed-by-in-spark) 2) one case which persisting is useful is after re partitioning and before joining. You can check [this](https://stackoverflow.com/questions/55319965/spark-sql-slow-execution-with-resource-idle/55329320#55329320) post for some factors that take place in repartitioning — abiratsis, Apr 15 '19 at 14:10
you mean to achieve the high locality? It's hard to tell, I think it is highly depends on the cluster as well (node available, etc). However, I did tried to persist/cache the highly usable dataframe and hash partition big dataframes before join. I found these two helped in speed up the computation. @Alexandros Biratsis — bohr, May 15 '19 at 08:23

Count operation resulting in more rack_local pyspark

1 Answers1