4

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.

Is there any drawbacks of allocating HDFS block per small file?

I've seen pretty contradictory answers:

  1. Answer which said the smallest file takes the whole block
  2. Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata

I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.

But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?

Update: my use case

I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.

1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.

2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.

sparkSession
    .read()
    .parquet(pathsToSmallFilesCollection)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);
Community
  • 1
  • 1
VB_
  • 45,112
  • 42
  • 145
  • 293
  • Please specify what you mean by **batch read** (Sequence File? HAR? any other aggregation?). I will answer the rest of your questions after you provide more details on the first one. – Serhiy May 09 '17 at 09:39
  • @Serhiy Suppose I have 10k small files, and need to read them all into memory at once. – VB_ May 09 '17 at 19:33

1 Answers1

6

As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:

  1. NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
  2. MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).

Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.

In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

Serhiy
  • 4,073
  • 3
  • 36
  • 66
  • thank you for so full answer! I appreciate it a lot. Could u pls take a look at update section of my question? – VB_ May 10 '17 at 13:27
  • I would propose you to make another question, since I am not Spark expert and this question is getting too broad. Just a speculation, as far as I know the small files are also a problem for Spark, unless you write your custom loader or **maybe** Sequence file/other file aggregation formats can reduce file load time (once again this is only speculation, once again I am **not** Spark expert). – Serhiy May 10 '17 at 13:56
  • _"for every small file ... a new Mapper"_ > that's the default, but Hadoop `CombineFileInputFormat` has been specifically created to buffer multiple small splits per Mapper; used in Hive via `hive.hadoop.supports.splittable.combineinputformat` property: _"Whether to combine small input files so that fewer mappers are spawned"_ -- see also `hive.input.format` in https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties – Samson Scharfrichter May 10 '17 at 16:06