I want partiton has only one key. Code in spark-shell
val rdd = sc.parallelize(List(("a",1), ("a",2),("b",1), ("b",3),("c",1),("e",5))).cache()
import org.apache.spark.HashPartitioner
rdd.partitionBy(new HashPartitioner(4)).glom().collect()
And, the result is:
Array[Array[(String, Int)]] = Array(Array(), Array((a,1), (a,2), (e,5)), Array((b,1), (b,3)), Array((c,1)))
There are 4 keys("a"
,"b"
,"c"
,"e"
), but they are in just 3 partitons though I define 4 HashPartitioner.I think it's hash collision because I use HashPartitioner
. So how can I implement different keys into different partitions.
I have read this answer, still cannot solve my question .