can map and reduce jobs be on different machines?

Question

i'm working or a very distinct solution on computational offloading, i can do that very well with a custom programming in c++/java but i'm in a search of same can be done in hadoop or any other framework ? i searched a lot but nothing worthy i found about that.

As we know a normal hadoop job made with Map and Reduce phase where both are running on machine which are having almost same power, for map phase we dont need the power and that can be offloaded to a cheap commodity hardware like RaspberryPI, while reduce should run on strong machine.

so is it possible to isolate these 2 phases and make them machine aware ?

I'm not sure whether you can configure hadoop to run map/reduce on different hosts always, but think about data-locality, which is the main driver to run both stages on same host — Iłya Bursov, Oct 15 '15 at 20:49
data locality is also a virtual in today's world, don't you think? consider i have mounted a big 1tb hdd to RPi? fairly possible. — Amey Jadiye, Oct 15 '15 at 21:13
data locality is the main hadoop's feature, each map/reduce job works with small piece of data, and its better to have it on local hdfs partition — Iłya Bursov, Oct 16 '15 at 02:06
what i'm aluding is, local data is also virtualised and mounted as nfs now days. so i can attach big hdd to small RPi and do a map job on that. — Amey Jadiye, Oct 16 '15 at 06:09

score 1 · Accepted Answer · answered Oct 15 '15 at 21:02

On each node you can create a mapred-site.xml file to override any default settings. These settings will then only apply to this node (task tracker).

For each node can then specify values for

mapreduce.tasktracker.reduce.tasks.maximum
mapreduce.tasktracker.map.tasks.maximum

On nodes where you only want to run reduce tasks set the maximum map tasks to 0 and the other way around.

Here is the list of configuration options

score 0 · Answer 2 · edited May 23 '17 at 12:23

Reducer jobs can run on different node but what is the advantage in running Reducer job on powerful machine?

You can use same commodity hardware configuration for both Map and Reduce nodes.

Fine tuning Map reduce job is trickier part depending on

1) Your input size

2) Time taken for Mapper to complete the Map job

3) Setting number of Map & Reducer jobs

etc.

Apart from config changes suggested by Gerhard, Have a look at some of the tips for fine tuning the performance Job

Tips to Tune the number of map and reduce tasks appropriately

Diagnostics/symptoms:

1) Each map or reduce task finishes in less than 30-40 seconds.

2) A large job does not utilize all available slots in the cluster.

3) After most mappers or reducers are scheduled, one or two remains pending and then runs all alone.

Tuning the number of map and reduce tasks for a job is important. Some tips.

1) If each task takes less than 30-40 seconds, reduce the number of tasks.

2) If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.

3) So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster.

4) Don’t schedule too many reduce tasks – for most jobs. Number of reduce tasks should be equal to or a bit less than the number of reduce slots in the cluster.

If you still want to have different configuration, have a look at this question and Wiki link

EDIT:

Configure mapred.map.tasks in 1.x (or mapreduce.job.maps in 2.x version) & mapred.reduce.tasks in 1.x (or mapreduce.job.reduces in 2.x version) accordingly in your nodes depending on hardware configuration. Configure more reducers in better hardware nodes. But before configuring these parameters, make sure that you have taken care of INPUT size, Map processing time etc

as said in question, map jobs dont need much power while reduce needs much more cpu power question is all about how can we reduce the cost of cluster offloading the simple task to cheap hardware, got it ? — Amey Jadiye, Oct 16 '15 at 06:18
Configure mapred.map.tasks & mapred.reduce.tasks accordingly in your VM nodes depending on hardware configuration. Configure more reducers in better hardware nodes. — Ravindra babu, Oct 16 '15 at 09:11
Configure mapred.map.tasks in 1.x (or mapreduce.job.maps in 2.x version) & mapred.reduce.tasks in 1.x (or mapreduce.job.reduces in 2.x version) accordingly in your nodes depending on hardware configuration — Ravindra babu, Oct 16 '15 at 09:20

can map and reduce jobs be on different machines?

2 Answers2