How can I force a (mostly) uniform distribution?
I want to perform something like:
df.repartition(5000) // scatter
.transform(some_complex_function)
.repartition(200) // gather
.write.parquet("myresult")
Indeed, 5000 tasks are executed after the repartition step. However, the size of input files per task varies between less than 1MB and 16MB.
The data is still skewed. How can I make sure it is no longer skewed and cluster resources are used efficiently.
edit
I learnt, that this is due to the usage of complex type columns i.e. arrays. Also note, that the some_complex_function
operates on this column i.e. its complexity increases with the number of elements inside the array.
Is there a way to partition better for such a case?