how does hashpartitioner work in spark?

Question

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile

I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.

Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?

However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?

Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

I think you are slightly off track here ;) Spark will automatically partition the data it reads in (using a HashPartitioner) and automatically distribute data among the nodes. If the data cannot fit into memory it spills it to disk. You do not need to worry about that unless it spills a lot and you need larger instance workers with more memory. If you do some transformation on your data (like map to key-value pairs or filtering), you may want to do repartitioning to avoid data skew. In which case you should have _at least_ numberOfComputeUnitsPerWorker*numberOfWorkers partitions. — Glennie Helles Sindholt, Sep 29 '15 at 07:50

how does hashpartitioner work in spark?

0 Answers0