0

I am doing two jobs of Word count example in the same cluster (I run hadoop 2.65 locally with my a multi-cluster) where my code run the two jobs one after the other. Where both of the jobs share the same mapper, reducer and etc. but each one of them has a different Partitioner.

Why there is a different allocation of the reduce task on the nodes for the second job? I am identifying the reduce task node by the node's IP (Java getting my IP address). I know that the keys would go to a different reduce task but I want that their destination would stay unchanged.

For example, I have five different keys and four reduce task. The allocation for Job 1 is:

  1. partition_1 ->NODE_1
  2. partition_2 ->NODE_1
  3. partition_3 ->NODE_2
  4. partition_4 ->NODE_3

The allocation for Job 2 is:

  1. partition_1 ->NODE_2
  2. partition_2 ->NODE_3
  3. partition_3 ->NODE_1
  4. partition_4 ->NODE_3
Rahim Dastar
  • 1,259
  • 1
  • 9
  • 15
Or Raz
  • 39
  • 2
  • 11

1 Answers1

0

In hadoop we haven’t locality for reducers so yarn select nodes for reducer based on the resources. There is no way to force hadoop to run each reducer on the same node in two job.

Rahim Dastar
  • 1,259
  • 1
  • 9
  • 15
  • As far as I know AM negotiates with RM about the the containers location and selects them based on the Memory, CPU and Disk requests (if given, although I still don't know the default ones) and Locality (which doesn't play a role in my own example because all the nodes are on the default rack). When you say "there is no way to force hadoop to run each reducer on the same node" you mean there is no easy way? because it is in the end an open source so there has to be a way to overwrite and recompile it.. If you know how to and where I should make the changes it could be very helpful. – Or Raz Sep 11 '18 at 20:24