how to reduce multiple small file load time in spark

Question

I have multiple small files in parquet format in a given HDFS location (the count is incremental for a given month as we receive two or more files per day for a given month). When I try to read the files from the HDFS location in SPARK 2.1 the time taken to read these files is more and is incremental when more small files are added to the given location.

Since the files are small I do not want to partition any further in HDFS.

Partitions are created by creating directories on HDFS and then the files are placed in those directories. File format is Parquet.

Is there any other format or process to read all the small files at once so that I can reduce the small files reading time.

Note: 1) Trying to create a program which can merge all the small files to one single file will add additional processing over head to my over all SLA to complete my process so I would keep this as my last option.

score 2 · Answer 1 · answered Jan 04 '18 at 17:39

If you don't want to merge your files, you should consider redesigning upstream process to limit the number of created files in the first place. If producer is Spark you can for example coalesce or repartition (Spark dataframe write method writing many small files) the data before writing.

Other than this (or merging as a separate step) there is not much you can do. Reading small files is just expensive. Adjusting spark.sql.files.openCostInBytes:

The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).

might help to some extent, but I wouldn't expect miracles.

score 0 · Answer 2 · answered Dec 05 '18 at 06:46

0

Please try wholeTextFiles! That works for many small files.

answered Dec 05 '18 at 06:46

gouxute

25
3

how to reduce multiple small file load time in spark

2 Answers2