I have a large number of input splits (about 50.000) created by small files that I want to process with Hadoop. I have, however, only 256 containers available to process it.
The job itself uses lots of CPU but considerable little memory.
I am using Hadoop 2.3 and was taking a look in the JVM reuse feature from MapReduce 1.0
I have also read about uber tasks, but it didn't look like the same thing - or I had a different understand from JVM reuse.
Since I have lots of small files (and am generating a inputSplit per file), I wanted to create a single JVM machine per container and run as many sequential Map tasks as possible per already allocated JVM. This would reduce the JVM allocating time overhead.
I guess that for each input split a new mapper would be allocated and thus a new JVM, am I right?
How can i do such thing in YARN?
Oh, I know that also I can use compressing in order to increase the inputSplit size, but for this exact application this is not viable.
Best Regards, Marco Lotz