3

I have a large number of input splits (about 50.000) created by small files that I want to process with Hadoop. I have, however, only 256 containers available to process it.

The job itself uses lots of CPU but considerable little memory.

I am using Hadoop 2.3 and was taking a look in the JVM reuse feature from MapReduce 1.0

I have also read about uber tasks, but it didn't look like the same thing - or I had a different understand from JVM reuse.

Since I have lots of small files (and am generating a inputSplit per file), I wanted to create a single JVM machine per container and run as many sequential Map tasks as possible per already allocated JVM. This would reduce the JVM allocating time overhead.

I guess that for each input split a new mapper would be allocated and thus a new JVM, am I right?

How can i do such thing in YARN?

Oh, I know that also I can use compressing in order to increase the inputSplit size, but for this exact application this is not viable.

Best Regards, Marco Lotz

Prometheus
  • 523
  • 7
  • 19

1 Answers1

5

Yes. In yarn, the tasks run in a dedicated JVM. And unlike mapreduce 1, it doesn't support JVM reuse.

In mapreduce 1 however, the property for controlling task JVM reuse is mapred.job.reuse.jvm.num.tasks. It specifies the maximum number of tasks to run for a given job for each JVM launched and by default it is 1. This answer should give you a better idea about JVM reuse in 1.

Community
  • 1
  • 1
Phani Rahul
  • 840
  • 7
  • 22