Merge small files in hadoop - What are the different ways?

Question

I have a use case where we have 800000 json files of size 2KB each. Our requirement is to merge these smaller files into a single large file. We have tried achieving this in Spark using repartition and coalesce. However we are not finding this efficient as this is consuming more time than expected. Is there any alternative to achieve the same in a performant manner ?

Appreciate your help. Thanks in advance.

Everything you tagged is a valid option, so what else did you try? Note: If your files are 2KB, Hadoop is not what you should be using. You only have 1.5GB with that calculation... **At the very least**, compress them all into a Bzip2 file before placing on HDFS — OneCricketeer, Feb 28 '18 at 22:07

score 0 · Answer 1 · answered Mar 13 '18 at 08:53

Hadoop isn't the right tool to use in your case. I would suggest just writing a small java or scala program that will read these files one by one and output to the single one. Any of hadoop related tools will give you a huge overhead in terms of the performance(as initialisation of pig for example takes approx 30 seconds) while standalone app will deal with these 800k files in 1-2 minutes or even less.

Merge small files in hadoop - What are the different ways?

1 Answers1