I am trying to understand how spark partitoing works.
To understand this I have following piece of code on spark 1.6
def countByPartition1(rdd: RDD[(String, Int)]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
def countByPartition2(rdd: RDD[String]) = {
rdd.mapPartitions(iter => Iterator(iter.length))
}
//RDDs Creation
val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa", 1)), 8)
countByPartition(rdd1).collect()
Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
countByPartition(rdd2).collect()
Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
In both the cases data is distributed uniformaly. I do have following questions on the basis of above observation:
- In case of rdd1, hash partitioning should calculate hashcode of key (i.e. "aa" in this case), so all records should go to single partition instead of uniform distribution?
- In case of rdd2, there is no key value pair so how hash partitoning going to work i.e. what is the key to calculate hashcode?
I have followed @zero323 answer but not getting answer of these. How does HashPartitioner work?