Is it possible to run several map task in one JVM?

Question

I want to share large in memory static data(RAM lucene index) for my map tasks in Hadoop? Is there way for several map/reduce tasks to share same JVM?

score 9 · Accepted Answer · answered Feb 02 '11 at 18:09

9

Jobs can enable task JVMs to be reused by specifying the job configuration mapred.job.reuse.jvm.num.tasks. If the value is 1 (the default), then JVMs are not reused (i.e. 1 task per JVM). If it is -1, there is no limit to the number of tasks a JVM can run (of the same job). One can also specify some value greater than 1 using the api.

answered Feb 02 '11 at 18:09

Joe Stein

1,255
2
11
14

1

Thanks, one more question. Does those tasks also share some Class-loader, so all statics resources will be loaded only once? (Or may it works like tomcat, in that way there are almost no reason to share JVM...) – yura Feb 04 '11 at 13:10
1

The JVM will be cleared after a task completes. This parameter only provides better runtime for jobs that are not "long-running" since the jvm instantiation is very expensive. You could not share any ressources over task instances. – Thomas Jungblut Feb 04 '11 at 21:04

score 4 · Answer 2 · answered Feb 02 '11 at 18:10

In $HADOOP_HOME/conf/mapred-site.xml add the follow property

<property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>#</value>
</property>

The # can be set to a number to specify how many times the JVM is to be reused (default is 1), or set to -1 for no limit on the reuse amount.

Yunming Zhang · Answer 3 · 2013-09-27T04:07:24.747

To my best knowledge, there is no easy way for multiple map tasks (Hadoop) to share static data structures.

This is actually a known problem for current Map Reduce model. The reason that current implementation doesn't share static datas across map tasks is because Hadoop is designed to be highly reliable. As a result, if a task fails, it will only crash its own JVM. It will not impact the execution of other JVMs.

I am currently working on a prototype that can distribute the work of a single JVM across multiple cores (essentially you just need one JVM to utilize multi cores). This way, you can reduce the duplication of in memory data structures without costing CPU utilization. The next step for me is to develop a version of Hadoop that can run multiple Map tasks within one JVM, which is exactly what you are asking for.

There is an interesting post here https://issues.apache.org/jira/browse/MAPREDUCE-2123

score 0 · Answer 4 · answered Jul 21 '11 at 17:04

Shameless plug

I go over using static objects with JVM reuse to accomplish what you describe here: http://chasebradford.wordpress.com/2011/02/05/distributed-cache-static-objects-and-fast-setup/

Another option, although more complicated, is to use distributed cache with a read-only memory mapped file. That way you can share the resource across the JVM processes as well.

Is it possible to run several map task in one JVM?

4 Answers4

Linked