Hadoop Relationship between Input Path and Nodes

Question

I'm using FileInputFormat.addInputPath to specify a path to the list of input files for my hadoop job. I've found that if I have x file in my input directory, x mappers will be started over the course of the whole job.

I was wondering if there is any way to specify which input files will correspond to some node, such that I can control which machine will operate on some set of input files.

The reason i'm doing this is because I'm working with a heterogenous cluster, and I want to balance the workload as evenly as possible.

score 1 · Accepted Answer · edited May 23 '17 at 10:32

1

You can't do that, since that would slow down your job significantly. However, you can increase the locality of your tasks by using the Fair Scheduler (due to a technique it uses called "Delay Scheduling"). This page has an explanation on the configuration parameters you can modify to achieve a higher locality (at the expense of waiting more for an adequate node); see the locality.threshold.* parameters.

See this other SO question for more details on the issue of locality in Hadoop. Also, see the "Delay Scheduling" section on the Hadoop Fair Scheduler design document.

edited May 23 '17 at 10:32

Community

1
1

answered Jan 02 '14 at 19:06

cabad

4,555
1
20
33

Thanks for your answer! However, I was wondering if you have any recommendations as to how I should go about dealing with heterogeneous machines? The difference in memory and CPU between the varying machines is relatively large, so I do want to avoid running the mappers that take a long time on the slower machines. – Olshansky Jan 02 '14 at 23:26
1

@olshansk You can't avoid them, but speculative execution should take care of your problem. See [this other SO question](http://stackoverflow.com/questions/15164886/hadoop-speculative-task-execution) for more details. – cabad Jan 03 '14 at 15:56

Hadoop Relationship between Input Path and Nodes

1 Answers1