How to efficiently distribute and use partitions in spark?

Question

This is my example.

val arr = Array((1,2), (1,3), (1,4), (2,3), (4,5))
val data = sc.parallelize(arr, 5)

data.glom.map(_length).collect
Array[Int] = Array(1, 1, 1, 1, 1)

val agg = data.reduceByKey(_+_)
agg.glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 0, 1)

val fil = agg.filter(_._2 < 4)
fil.glom.map(_.length).collect
Array[Int] = Array(0, 0, 1, 0, 0)

val sub = data.map{case(x,y) => (x, (x,y))}.subtractByKey(fil).map(_._2)
Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (4,5))

sub.glom.map(_.length).collect
Array[Int] = Array(0, 3, 0, 0, 1)

What I'm wondering is to evenly distribute partitions.

The data variable consists of five partitions, with all the data evenly partitioned.

ex)par1: (1,2)
   par2: (1,3)
   par3: (1,4)
   par4: (2,3)
   par5: (4,5)

After several transformation operation, Only two of the five partitions allocated to the sub variable are used.

The sub variable consists of five partitions, but not all data is evenly partitioned.

ex)par1: empty
   par2: (1,2),(1,3),(1,4)
   par3: empty
   par4: empty
   par5: (4,5)

If I add another transformation operation to the sub variable, there will be 5 available partitions, but only 2 partitions will be used for the operation.

ex)sub.map{case(x,y) => (x, x, (x,y))}

So I wanna make use of all available partitions when data is operated on.

I used the repartition method, but it is not cheaper.

ex) sub.repartition(5).glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 2, 0)

So I'm looking for a wise way to utilize as many partitions as possible.

Is there a good way?

score 3 · Answer 1 · answered Mar 27 '17 at 11:09

So repartition is definitely the way to go :)

Your example is a little too simple to demonstrate anything as Spark is build to handle billions of rows - not 5 rows. repartition will not put exactly the same number of rows into each partition, but it will distribute data evenly-ish. Try to redo your example with 1.000.000 rows instead and you will see that data is indeed distributed evenly after a repartition.

Data skew is often a big problem when working with transformations of large amounts of data, and repartitioning your data does come with the cost of additional time as it needs to shuffle data around. Sometimes it is worth taking the penalty though, because it will make the following transformation stages run faster.

How to efficiently distribute and use partitions in spark?

1 Answers1