0

I have this query. lets say i have 3 datanode+nodemanager(clusters). we have replication factor of 3. At first cluster we got 4 blocks, so by default 4 mappers will run parallelly on first cluster. then as we have replication factor of 3, we ll have 12 mappers running in beginning ?

Community
  • 1
  • 1
Ramineni Ravi Teja
  • 3,568
  • 26
  • 37

1 Answers1

2

Number of block depends on file size. If you have 1gb of file that makes 8 blocks (of 128 mb).

So now all 8 blocks will be replicated three times by following data locality and rack awareness - but it doesn't mean all 24 (8 x 3) blocks will be processed when you run any job against this file. Replication is for to recover from disk failures type of scenarios.

So to answer your questions:

Number of mappers = number of input splits(in most cases number of blocks).

There will be only 8 mappers running on cluster. Hadoop will decide on which node these mappers need to be run based on data locality - at closest block location in cluster(node).

There will be different case if speculative execution is enabled for the cluster - hadoop-speculative-task-execution

Community
  • 1
  • 1
Ronak Patel
  • 3,819
  • 1
  • 16
  • 29