I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id
to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?