Apache Spark on HDFS: read 10k-100k of small files at once

Question

I could have up to 100 thousands of small files (each 10-50 KB). They all are stored at HDFS with block size 128 MB. I have to read them at once with Apache Spark, as below:

// return a list of paths to small files
List<Sting> paths = getAllPaths(); 
// read up to 100000 small files at once into memory
sparkSession
    .read()
    .parquet(paths)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);

Problem

The number of small files is not a problem from the perspective of memory consumption. The problem is the speed of reading of that amount of files. It takes 38 seconds to read 490 small files, and 266 seconds to read 3420 files. I suppose it would take a lot to read 100.000 files.

Questions

Will HAR or sequence files speed up Apache Spark batch read of 10k-100k of small files? Why?

Will HAR or sequence files slow down persisting of that small files? Why?

P.S.

Batch read is the only operation required for that small files, I don't need to read them by id or anything else.

Hadoop has a standard workaround, commonly used in Hive to read small "streamed" files (cf. my comment in your previous question) and you are not the first Spark-ikaze to stumble on that problem -- cf. http://stackoverflow.com/questions/24623402/apache-spark-on-yarn-large-number-of-input-data-files-combine-multiple-input-f — Samson Scharfrichter, May 10 '17 at 16:25

score 4 · Answer 1 · edited May 23 '17 at 11:54

From that post: How does the number of partitions affect `wholeTextFiles` and `textFiles`?

wholeTextFiles uses WholeTextFileInputFormat ... Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition ... Each record in the RDD ... has the entire contents of the file

Confirmation in the Spark 1.6.3 Java API documentation for SparkContext
http://spark.apache.org/docs/1.6.3/api/java/index.html

RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int minPartitions)
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

Confirmation in the source code (branch 1.6) comments for class WholeTextFileInputFormat
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala

A org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat for reading whole text files. Each file is read as key-value pair, where the key is the file path and the value is the entire content of file.

For the record, Hadoop CombineInputFormat is the standard way to stuff multiple small files in a single Mapper; it can be used in Hive with properties hive.hadoop.supports.splittable.combineinputformat and hive.input.format.

Spark wholeTextFiles() reuses that Hadoop feature, with two drawbacks:
(a) you have to consume a whole directory, can't filter out files by name before loading them (you can only filter after loading)
(b) you have to post-process the RDD by splitting each file into multiple records, if required

That seems to be a viable solution nonetheless, cf. that post: Spark partitioning/cluster enforcing

Or, you can build your own custom file reader based on that same Hadoop CombineInputFormat, cf. that post: Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

thank you for the answer! I've already implemented that stuff through sequence files `saveAsHadoopFile(path, String.class, SmallFileWrapper.class, SequenceFileOutputFormat.class)`. Could you pls compare sequence files with `wholeTextFiles` solutions? Which is better and when? — VB_, May 11 '17 at 11:06
Duh... IMHO `wholeTextFiles` is a stop-gap solution, when you have a swarm of small test files that you need to read efficiently in one pass (then reprocess in memory). Now, if you want to consolidate the data first (e.g. Write-Once-Read-Many times scenario) all the options are open -- file format, compression, etc. It depends. — Samson Scharfrichter, May 11 '17 at 16:18

Apache Spark on HDFS: read 10k-100k of small files at once

Problem

Questions

P.S.

1 Answers1

Linked