0

I am trying to understand how spark partitoing works.

To understand this I have following piece of code on spark 1.6

def countByPartition1(rdd: RDD[(String, Int)]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

def countByPartition2(rdd: RDD[String]) = {
    rdd.mapPartitions(iter => Iterator(iter.length))
}

//RDDs Creation

val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa", 1)), 8)
countByPartition(rdd1).collect()

Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
countByPartition(rdd2).collect()

Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

In both the cases data is distributed uniformaly. I do have following questions on the basis of above observation:

  1. In case of rdd1, hash partitioning should calculate hashcode of key (i.e. "aa" in this case), so all records should go to single partition instead of uniform distribution?
  2. In case of rdd2, there is no key value pair so how hash partitoning going to work i.e. what is the key to calculate hashcode?

I have followed @zero323 answer but not getting answer of these. How does HashPartitioner work?

Vikash Pareek
  • 1,063
  • 14
  • 30
  • Thank you @zero323, I understand this with the help of https://stackoverflow.com/questions/34491219/default-partitioning-scheme-in-spark answer. In my use case I am performing join between 2 dataframes (where spark.sql.shuffle.partitions is set to 16) and resultant dataframe contains 16 partition but somehow all the records gone to single partition and rest 15 partitions are empty. What can be the possible causes for this skewness? Am I missing something for uniform distribution? – Vikash Pareek Jun 24 '17 at 16:13
  • Hard to say without seeing the code. Either skewed data distribution, or using some inefficient parts of API, like window functions. – zero323 Jun 25 '17 at 00:49

0 Answers0