If we have millions of small text files of size varying from few KB's to few MB's, which one of HDFS and HBASE takes less processing time? And also less memory consumption?
-
HDFS is not meant for small files. Related. http://stackoverflow.com/questions/13993143/hdfs-performance-for-small-files?rq=1 – OneCricketeer Nov 28 '16 at 09:35
2 Answers
This is a high level question. Information about the type of data is missing. However, in general terms, we need to keep following things in mind while deciding things like where to store? In HDFS or HBase:
Since we have smaller files in large quality, storing it in HDFS has couple of problems.
- Metadata on Name node will be high
If block size (input splits size) are not configured properly, full potential of data locality and parallel processing will not be utilized. For more information on relation between Input split and Block size, please refer Split size vs Block size in Hadoop.
So, storing it in HDFS is virtually ruled out unless you have a strong reason to do so.
If we choose to store in HDFS, can we merge files together to make it sufficiently large to the block size? How does this impact the performance?
HBase however, overcomes these problems because it stores data in tables and also by compaction methods. But before concluding HBase as storage platform, we need to consider following points:
- Does the data at hand have a schema suitable for HBase? Or does the data has a schema?
- Can we construct a row-key suitable for the data which can spread across HBase Region Servers?
If we have answers to all these questions, we can come to a conclusion. Suggest you to retrospect your data on these lines and make a careful decision. This is not a solution but a way or direction in which you should think and proceed.
If you have millions of small files varying from KBs to MBs, HDFS and MapReduce job is overkill to process data.
HBase is one alternative to address this issue. But you have other alternatives like Hadoop archive file ( HAR) and Sequence files.
Refer to these related SE questions:

- 1
- 1

- 37,698
- 11
- 250
- 211