0

I want partiton has only one key. Code in spark-shell

val rdd = sc.parallelize(List(("a",1), ("a",2),("b",1), ("b",3),("c",1),("e",5))).cache()
import org.apache.spark.HashPartitioner
rdd.partitionBy(new HashPartitioner(4)).glom().collect()

And, the result is:

Array[Array[(String, Int)]] = Array(Array(), Array((a,1), (a,2), (e,5)), Array((b,1), (b,3)), Array((c,1)))

There are 4 keys("a","b","c","e"), but they are in just 3 partitons though I define 4 HashPartitioner.I think it's hash collision because I use HashPartitioner. So how can I implement different keys into different partitions. I have read this answer, still cannot solve my question .

Community
  • 1
  • 1

1 Answers1

1

You are right. It has a collision of hashes - some of the keys produce such hash values, that hash % numPartitions returns same value.

The only solution here is to create your own partitioner, which will put each key into separate partition. Just make sure to have enough partitions.

More about partitioning is here and here and here and here.

Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33