Apache spark data locality algorithm

Question

I want to know the algorithm used to make spark data locality aware when the task are scheduled? Do we need a cluster manager like YARN to do so,If yes then what is the underlying algorithm to schedule the tasks??

score 0 · Answer 1 · edited May 23 '17 at 12:23

0

It depends. If your data is in the form of key-value pairs than Spark handles data locality through partitioners (usually by hashing the key, but you can define custom partitioners or use a RangePartitioner to optimize your locality depending on your data). If your data is not given a key, then usually it just holds on to the data in a per-file basis (which can be problematic if you have few large files as you might not be working at optimal parallelism). If your data is either too distributed or too localized, you can respectively use repartition(numPartitions) and coalesce(numPartitions) to optimize how many partitions you want to work with.

Here is an example of how you can create a custom partitioner:

How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

edited May 23 '17 at 12:23

Community

1
1

answered Jan 25 '16 at 01:44

Daniel Imberman

618
1
5
18

Is there any specific algorithm used to optimize it by spark?? – openArrow Jan 25 '16 at 03:11
I realize now that you're asking more at a systems level rather than the actual process of programmatically optimizing your load balancing. This page should have exactly what you're looking for http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application – Daniel Imberman Jan 25 '16 at 07:10

Apache spark data locality algorithm

1 Answers1