1

How can I force a (mostly) uniform distribution?

I want to perform something like:

df.repartition(5000) // scatter
.transform(some_complex_function)
.repartition(200) // gather
.write.parquet("myresult")

Indeed, 5000 tasks are executed after the repartition step. However, the size of input files per task varies between less than 1MB and 16MB.

The data is still skewed. How can I make sure it is no longer skewed and cluster resources are used efficiently.

edit

I learnt, that this is due to the usage of complex type columns i.e. arrays. Also note, that the some_complex_function operates on this column i.e. its complexity increases with the number of elements inside the array.

Is there a way to partition better for such a case?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

1

repartition should distribute the number of records uniformly, you can verify that using the techniques listed here : Apache Spark: Get number of records per partition

If your record contain some complex data structures, or strings of various lengths, then the number of bytes per partition will not be equal. I asked for a solution to this problem here : How to (equally) partition array-data in spark dataframe

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • I am almost certain I am experiencing the same problem. However, in my case the input data is fairly small (assume 10G) and the transformation rather expensive. As with your own case it is not proportional with the records of DF but rather proportional with the observations inside the array. I.e. in my case repartition is considered rather cheap (that's why I already went from default 200 to 5000). Still, the non uniform load partitions make the job take forever. – Georg Heiler Sep 16 '18 at 11:53
  • What do you think about adding an additional column as `size` of the array data structure and then repartition according to the size? – Georg Heiler Sep 16 '18 at 11:54
  • However, that would bring all equally sized arrays into the same partition, which is not what I want. Especially not for the large arrays. – Georg Heiler Sep 16 '18 at 12:02