1

In apache spark -> By using the Hash partition-er , I believe the keys with same hash value will go on to same node? what if most of the keys goes on to same partition-er and how to balance the data on multiple worker node in such scenarios . please help me out

nikhil08
  • 23
  • 3

1 Answers1

0

Doc says .. A Partitioner that implements hash-based partitioning using Java's Object.hashCode

Yes. you are right. So if distribution of keys are not uniform you can end up in situations when part of your cluster is idle. See

Its your responsibility to ensure that keys are uniformly distributed across.(that means hashcode should not be same)

For that you need better understanding of HashPartitioner and what it does internally.

Note : Hash code of the key will just be the key itself. The HashPartitioner will mod it with the number of total partitions. i.e hashcode Mod with totnumpartions.

Below Util class method is used for that purpose by HashPartitioner

def nonNegativeMod(x: Int, mod: Int): Int = {
  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
}
Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • So you mean to say we need to write our own custom partition-er to split the keys among all the partitions we have ? – nikhil08 Jan 25 '17 at 18:00
  • No need to write custom partioner. Hashcode of keys should be uniform. For multiple keys hash code should not be same. – Ram Ghadiyaram Jan 25 '17 at 18:18