Aggregating scattered files in Spark

Question

I have a job that ingests data on a daily basis in S3 partitioning by a specific field, e.g.:

...
result_df.write.partitionBy("my_field").parquet("s3://my/location/")

This ingestion process will write to already existing partitions every day, adding files containing one or just a few record. I want to emphasize that this is gonna happen every day: with time, this is going to generate many small files which everybody hates. You would probably tell me that is not the best field for partitioning, but this is the field needed by the business.

So I was thinking to run another job that reviews partitions containing too many files and coalesce them on a daily basis. But unfortunately I can't think of an efficient way to coalesce these files with Spark. The only solution that came to my mind is

reading the partition with too many small files
repartition and write the results on a support folder
delete the source partition
move the data generated in step 2 to the original partition

I really don't like the idea of moving data so many times, and I find it inefficient. The ideal is to group all files in the same partition in a smaller number, but with Spark it doesn't look feasible to me.

Are there any best practices regarding this use case? Or any improvement to the suggested process?

There are numerous questions on the subject of merging parquet files in SO, for example https://stackoverflow.com/questions/38610839/how-to-merge-multiple-parquet-files-to-single-parquet-file-using-linux-or-hdfs-c. Unfortunately, there is no better solution than your own idea. — mazaneicha, Apr 15 '20 at 13:14
@mazaneicha thanks for your edit and reply. I was hoping something smarter was available :( But this solution, because of the involved computational cost, discourages me from using this field as partition column. But I don't think I am the only person who faced a similar use case and I guess it might be a common issue; really people apply this solution when facing such issue? Or just avoid using this uncomfortable partition field? — Vzzarr, Apr 16 '20 at 11:08
Compactions are a part of life in Hadoop, everybody does it. The situation has more to do with "appendability" of parquet format and Hadoop file system in general than your choice of a partition field. If your case results in really severe small file problem, maybe its time to look at alternative storage mechanisms, HBase or Kudu. — mazaneicha, Apr 16 '20 at 12:29

Aggregating scattered files in Spark

0 Answers0