Handle 3 Million+ small text files in Hadoop 2.0

Question

I am looking for a solution (in HADOOP 2.2+ version) for the following problem statement.

Problem Statement:

We need to process 3 Million+ files on a daily basis. We are interested to capture the file name as well as the data in the file. How do I process this data in the most efficient way?

I am aware about "CombineFileInputFormat", "MultiFileInputSplit" and "HAR File Layout" but I am not sure which one would be better in terms of performance.

If there are any other better options that you are aware of, please share.

devinbost · Answer 1 · 2019-01-20T00:15:11.220

What do you mean by "process" or "capture"? Since that could mean almost anything in the context that was provided, I will intentionally assume that deleting the files will meet your requirements (even though it probably doesn't) so that I can make a point about what can happen when insufficient information is provided.

Therefore, based on that purpose, to answer your question, the most efficient way to process all of your data files would be to delete all of them. That would "capture" all of your files, including the names of the files and the data. Using hadoop, that would be:

hadoop fs -rm -r /PATH/TO/FILES/TO/DELETE

However, depending on where and how the files are being stored, you may need to use a different method to delete the files, such as:

TRUNCATE TABLE [tableName]

(if you're using a SQL database)

or:

rm -rf /path/to/files

(if it's on a local linux filesystem)

If this answer does not solve your current problem, then please be more specific about what you are trying to do because your question is ambiguous. Welcome to Stack Overflow. We want to help, but we cannot read your mind.

Here are things that need to be clarified:

What is meant by "process"? The word "process" could mean anything. Trying to concatenate files? Concatenate certain files based on certain rules? Compute an aggregation? Filter out certain data? Join data? Perform a combination of these operations? Is deduplication or validation on the files necessary? And is the operation a batch or streaming operation? If you're considering using Hadoop, I hope you're not dealing with a streaming operation.

What are the file types, and what is the data? Are they text files? Binary files? Parquet files? XML files? JSON? CSV files? Are they encrypted? Could they contain garbage data? What if they're all just symlinks? We can't know how to "process" the files more specifically than doing something generic like compression or deletion if it's not clear what the files/data consist of. Your comment about the "HAR" spec suggests that this data has something to do with monitoring. However, you're asking if that would be the correct file format to use, but no detail or examples of what the data are or what is required to be done with them, and there is no file format that solves every possible problem in the most efficient way. (Otherwise, there would only be one file format in existence that anybody would ever use.)

What is meant by "capture"? Does the data need to be saved into a database? A SQL database? HBase? A NoSQL database like DynamoDB? Does the "captured" data need to be mapped into another file? Do the files need to be transformed into a structured format like JSON? Does the operation need to output specific data like parquet files? Depending on the memory requirements of your operation, you could potentially get 100x speedup by using Spark or PySpark instead of Hadoop. But we need more information to make a recommendation like that with more precision. Be sure to use the right technology for the right purpose.

What is meant by "efficient"? Does "efficient" mean algorithmic runtime? That depends on the actual process that needs to be computed. Or does "efficient" mean memory or storage? Again, this is unclear.

Also, in the future, we need more context about the details as well. For example, if mentioning specific versions of Hadoop, we need to know how or why specific versions of Hadoop might be relevant. For all we know, Hadoop might be a totally inappropriate tool for processing that many files because Spark or Flink, for example, might be more appropriate. Or maybe Elasticsearch. Or maybe a graph technology. Or maybe Amazon Kinesis with a Lambda. We need more information to give specific recommendations.

There are additional guidelines on how to ask an effective Stack Overflow question here: https://stackoverflow.com/help/how-to-ask I'm sorry if this answer post seems harsh, but I recommend that you accept that you need to provide more detail in your question and write a new question that will get you a fresh set of eyes.

The point is that it's hard to answer a highly ambiguous question. — devinbost, Jan 08 '19 at 00:06

Handle 3 Million+ small text files in Hadoop 2.0

1 Answers1

Here are things that need to be clarified: