I used to use df.repartition(1200).write.parquet(...)
which created 1200 number of files as specified in the repartion argument. I am now using paritionBy
, i.e. df.repartition(1200).write.partitionBy("mykey").parquet(...)
. This works fine, except that it is now creating 1200 files per bucket of mykey
. I would like to have 1200 files over all.
Other posts suggest to repartition across certain keys. The relevant documentation for my spark version (2.4.0
) seems to suggest that this feature was added later. Is there any other way to achieve it? I guess I could repartition to 1200/len(unique("mykey")
. But that's a bit hacky. Is there a better way to do it? I am also worrying that reducing the number of partitions results in out of memory erros.