This is my example.
val arr = Array((1,2), (1,3), (1,4), (2,3), (4,5))
val data = sc.parallelize(arr, 5)
data.glom.map(_length).collect
Array[Int] = Array(1, 1, 1, 1, 1)
val agg = data.reduceByKey(_+_)
agg.glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 0, 1)
val fil = agg.filter(_._2 < 4)
fil.glom.map(_.length).collect
Array[Int] = Array(0, 0, 1, 0, 0)
val sub = data.map{case(x,y) => (x, (x,y))}.subtractByKey(fil).map(_._2)
Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (4,5))
sub.glom.map(_.length).collect
Array[Int] = Array(0, 3, 0, 0, 1)
What I'm wondering is to evenly distribute partitions.
The data
variable consists of five partitions, with all the data evenly partitioned.
ex)par1: (1,2)
par2: (1,3)
par3: (1,4)
par4: (2,3)
par5: (4,5)
After several transformation operation
, Only two of the five partitions allocated to the sub
variable are used.
The sub
variable consists of five partitions, but not all data is evenly partitioned.
ex)par1: empty
par2: (1,2),(1,3),(1,4)
par3: empty
par4: empty
par5: (4,5)
If I add another transformation operation
to the sub
variable, there will be 5 available partitions, but only 2 partitions will be used for the operation.
ex)sub.map{case(x,y) => (x, x, (x,y))}
So I wanna make use of all available partitions when data is operated on.
I used the repartition
method, but it is not cheaper.
ex) sub.repartition(5).glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 2, 0)
So I'm looking for a wise way to utilize as many partitions as possible.
Is there a good way?