Parsing millions of small XML files

Question

I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers?

Are your expectations realistic? How fast do you think it is possible to process 5Tb of data? How slow is very slow? — Michael Kay, Sep 17 '15 at 00:12
i have 16 node cluster .what do you think ,how much time it should take ideally?it's taking hours to parse 2-3 gb of data (4-5 hours) — Aryan087, Sep 17 '15 at 11:45
I think 1Gb/minute is a reasonable starting point for a typical desktop machine. You should be able to get a lot better than that if you can exploit the concurrency. If it is taking hours to process 2-3Gb then something is seriously wrong, and we should focus on finding out exactly what you are doing and where the time is going. — Michael Kay, Sep 17 '15 at 13:04

score 2 · Accepted Answer · edited May 23 '17 at 11:44

Hadoop doesn't work very well with a huge amount of small files. It was designed to deal with few very big files.

Compress your files won't help because as you have noticed the problem is that your job require to instantiate a lot of containers to execute the maps (one for each file). Instantiate containers could take more than the time required to process the input (and a lot of resources like memory and CPU).

I'm not familiar with Mahaout's input formats but in hadoop there is a class that minimize that problem combining several inputs in one Mapper. The class is CombineTextInputFormat. To work with XML's you may require to create your own XMLInputFormat extending CombineFileInputFormat.

Another alternative but with less imprvement could be reuse the JVM among the containers: reuse JVM in Hadoop mapreduce jobs

Reusing the JVM safe the time required to create each JVM but you are still requiring create one container for each file.

score 1 · Answer 2 · edited Jun 04 '16 at 19:36

You can follow one of the three approaches as quoted in this article:

Hadoop Archive File (HAR)
Sequence Files
HBase

I have found article 1 and article 2, which list multiple solutions (I have removed some non-generic alternatives from these articles):

Change the ingestion process/interval: Change the logic at source level to reduce large number of small files and try to generate small number of big files
Batch file consolidation: When small files are unavoidable, file consolidation is most common solution. With this option you periodically run a simple, consolidating MapReduce job to read all of the small files in a folder and rewrite them into fewer larger files
Sequence files: When there is a requirement to maintain the original filename, a very common approach is to use Sequence files. In this solution, the filename is stored as the key in the sequence file and the file contents are stored as the value
HBase: Instead of writing file to disk,write the file to HBase memory store.
Using a CombineFileInputFormat: The CombineFileInputFormat is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them “on the fly” for consumption by a single map task.

Parsing millions of small XML files

2 Answers2

Linked