In apache spark -> By using the Hash partition-er , I believe the keys with same hash value will go on to same node? what if most of the keys goes on to same partition-er and how to balance the data on multiple worker node in such scenarios . please help me out
Asked
Active
Viewed 355 times
1 Answers
0
Doc says .. A Partitioner that implements hash-based partitioning using Java's Object.hashCode
Yes. you are right. So if distribution of keys are not uniform you can end up in situations when part of your cluster is idle. See
Its your responsibility to ensure that keys are uniformly distributed across.(that means hashcode should not be same)
For that you need better understanding of HashPartitioner
and what it does internally.
Note : Hash code of the key will just be the key itself. The HashPartitioner
will mod it with the number of total partitions. i.e hashcode Mod with totnumpartions.
Below Util class method is used for that purpose by HashPartitioner
def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}
- for better understanding look at perfect example - spark-hashpartitioner-unexpected-partioning - answer by @user6910411

Community
- 1
- 1

Ram Ghadiyaram
- 28,239
- 13
- 95
- 121
-
So you mean to say we need to write our own custom partition-er to split the keys among all the partitions we have ? – nikhil08 Jan 25 '17 at 18:00
-
No need to write custom partioner. Hashcode of keys should be uniform. For multiple keys hash code should not be same. – Ram Ghadiyaram Jan 25 '17 at 18:18