When reading from a hive table and performing a projection and writing this back to HDFS obviously less data is present than in the raw table.
How can I ensure, that the number of files per partition (date) is not very large/i.e. contains a large number of small files?
df.coalesce(200).write.partitionBy(date).parquet('foo)
still outputs many small files. Obviously, I would like to not reduce the paralellism in spark but rather just merge files later on.