how to balance the load while using hash partition-er?

Question

In apache spark -> By using the Hash partition-er , I believe the keys with same hash value will go on to same node? what if most of the keys goes on to same partition-er and how to balance the data on multiple worker node in such scenarios . please help me out

score 0 · Accepted Answer · edited May 23 '17 at 12:16

Doc says .. A Partitioner that implements hash-based partitioning using Java's Object.hashCode

Yes. you are right. So if distribution of keys are not uniform you can end up in situations when part of your cluster is idle. See

Its your responsibility to ensure that keys are uniformly distributed across.(that means hashcode should not be same)

For that you need better understanding of HashPartitioner and what it does internally.

Note : Hash code of the key will just be the key itself. The HashPartitioner will mod it with the number of total partitions. i.e hashcode Mod with totnumpartions.

Below Util class method is used for that purpose by HashPartitioner

def nonNegativeMod(x: Int, mod: Int): Int = {
  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
}

for better understanding look at perfect example - spark-hashpartitioner-unexpected-partioning - answer by @user6910411

So you mean to say we need to write our own custom partition-er to split the keys among all the partitions we have ? — nikhil08, Jan 25 '17 at 18:00
No need to write custom partioner. Hashcode of keys should be uniform. For multiple keys hash code should not be same. — Ram Ghadiyaram, Jan 25 '17 at 18:18

how to balance the load while using hash partition-er?

1 Answers1