I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
- Answer which said the smallest file takes the whole block
- Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);