11

I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.

Is there any other way to go about this?

user3898179
  • 211
  • 5
  • 13
  • "*This leads to a large number of empty partitions which also get processed during piping*" I do not understand this sentence. Why and when are this empty partitions created? – Mikel Urkia Oct 22 '15 at 09:30
  • Suppose I am fetching data using Hive and my hdfs has 500 file blocks for given Hive Table, in that case 500 partitions will be created in RDD. Later while doing a groupbykey, empty partitions are left. – user3898179 Oct 22 '15 at 09:35
  • 1
    If you have some a priori about your data you can repartition using either `RangePartitioner` or `HashPartitioner`. If not you can use partition based on random numbers. – zero323 Oct 22 '15 at 09:39
  • I'd say empty partitions are automatically *deleted* and not processed by Spark, although I am not 100% sure. – Mikel Urkia Oct 22 '15 at 09:43
  • 2
    @MikelUrkia empty partitions are not deleted (you can see them in the Spark UI). I have, however, never experienced empty partitions after doing a `repartition`... – Glennie Helles Sindholt Oct 22 '15 at 10:45
  • But repartition is really costly due to data reshuffling. How do we know which partitions are empty if the job is being submitted on a cluster? – user3898179 Oct 22 '15 at 10:57

1 Answers1

9

There isn't an easy way to simply delete the empty partitions from a RDD.

coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45).

The repartition method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20), the data will be evenly split across the 20 partitions.

Powers
  • 18,150
  • 10
  • 103
  • 108
  • repartition is not a solution here right ? If I have a single partition with 1 record of data and I call repartition(10) , that would result in 1 partition having a record and 9 empty partitions . That would not address the problem. – JavaPlanet Mar 21 '22 at 17:42
  • @JavaPlanet - Yea, if you have 1 row of data and repartition it into 10 partitions, then you'll still have empty partitions. If you have 1 million rows of data and repartition it into 10 partitions, then you won't have any empty partitions. So the number of partitions needs to be selected intelligently to ensure no empty partitions are created. – Powers Mar 22 '22 at 11:05