3

Say I have 200 input files and 20 nodes, and each node has 10 mapper slots. Will Hadoop always allocate the work evenly, such that each node will get 10 input files and simultaneously start 10 mappers? Is there a way to force this behavior?

sangfroid
  • 3,733
  • 11
  • 38
  • 42

1 Answers1

2

how many mappers are used is determined by input -- specifically input splits. So in your case, 200 files could be fed to 200 mappers. But the real answer is a little more complicated. It depends on

  • file size : if a file is bigger than a block size, then block sized chunk is sent to a mapper

  • are the files splittable. for example gzip compressed files can not be split. And the one entire file goes to one mapper (even if the file is bigger than a block size)

Sujee Maniyam
  • 1,093
  • 1
  • 9
  • 15
  • Let's assume the files are really small, like less than a block. Here's another question : if I have 20 nodes and each node has 10 mappers, what happens when I only have 20 input files? Will they spread evenly throughout the cluster? Or will 2 nodes each get 10 files? – sangfroid Mar 19 '13 at 20:47
  • hadoop will try to schedule jobs on nodes that have the files; so the jobs use data locally and not stream it across network. So, and this is a guess, only a few nodes might get to run the mappers. Good question though! (if you can do a few runs and post your findings, that would be fantastic) – Sujee Maniyam Mar 22 '13 at 20:13