0

I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:

sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")

The reason for this is I want another system to be able to delete only select users' data upon request.

This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).

How do I obtain this with pyspark?

casparjespersen
  • 3,460
  • 5
  • 38
  • 63
  • maybe this https://stackoverflow.com/questions/60048027/how-to-manage-physical-data-placement-of-a-dataframe-across-the-cluster-with-pys/60048672#60048672 could help – murtihash Jun 08 '20 at 18:29

2 Answers2

1

Just add coalesce and no. of file you want.

sdf.coalesce(1).write.partitionBy("user_id").json("...")
QuickSilver
  • 3,915
  • 2
  • 13
  • 29
  • Thank you. I found that repartition (as proposed by Shubham Jain) was the fastest, so I accepted his/her answer. – casparjespersen Jun 08 '20 at 19:53
  • coalesce will always be more efficient than reparition in the performance when there is a reduce in partition https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce @casparjespersen – QuickSilver Jun 08 '20 at 20:02
  • This was not my experience. I'll be happy to share more details about setup/data if you're interested. – casparjespersen Jun 08 '20 at 20:08
  • Hey @QuickSilver please go through this link specially the 5th point https://datanoon.com/blog/spark_repartition_coalesce/ This explained the concept in very simple ways and I am open for further discussion also. Coalesce has some limitations hence should be used cautiously. – Shubham Jain Jun 09 '20 at 04:50
  • I have exactly the same point the author wants the no. of files to be reduced and Coalesce is good for it with less or no shuffling @ShubhamJain – QuickSilver Jun 09 '20 at 04:56
  • Incase there are more than one partition to be created and we coalesce the df then all the partitions will be reduced to one partition and then distributed but in case of repartition the number of partition will be equal to the number of distinct user_id(partition) – Shubham Jain Jun 09 '20 at 04:58
  • He wants one file in each partition.. nvm I was just suggesting – Shubham Jain Jun 09 '20 at 05:00
1

You can use repartition to ensure that each partition gets one file

sdf.repartition('user_id').write.partitionBy("user_id").json("...")

This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38