1

I'm using FileInputFormat.addInputPath to specify a path to the list of input files for my hadoop job. I've found that if I have x file in my input directory, x mappers will be started over the course of the whole job.

I was wondering if there is any way to specify which input files will correspond to some node, such that I can control which machine will operate on some set of input files.

The reason i'm doing this is because I'm working with a heterogenous cluster, and I want to balance the workload as evenly as possible.

Olshansky
  • 5,904
  • 8
  • 32
  • 47

1 Answers1

1

You can't do that, since that would slow down your job significantly. However, you can increase the locality of your tasks by using the Fair Scheduler (due to a technique it uses called "Delay Scheduling"). This page has an explanation on the configuration parameters you can modify to achieve a higher locality (at the expense of waiting more for an adequate node); see the locality.threshold.* parameters.

See this other SO question for more details on the issue of locality in Hadoop. Also, see the "Delay Scheduling" section on the Hadoop Fair Scheduler design document.

Community
  • 1
  • 1
cabad
  • 4,555
  • 1
  • 20
  • 33
  • Thanks for your answer! However, I was wondering if you have any recommendations as to how I should go about dealing with heterogeneous machines? The difference in memory and CPU between the varying machines is relatively large, so I do want to avoid running the mappers that take a long time on the slower machines. – Olshansky Jan 02 '14 at 23:26
  • 1
    @olshansk You can't avoid them, but speculative execution should take care of your problem. See [this other SO question](http://stackoverflow.com/questions/15164886/hadoop-speculative-task-execution) for more details. – cabad Jan 03 '14 at 15:56