1

I know the default block size is 64M, split is 64M, then for files less than 64M , when the number of nodes increase from 1 to 6 , there will be only one node to do with the split, so the speed will not improve? Is that right? If it is a 128M file, there will be 2 nodes to do with the 2 splits, the speed is faster than 1 node, if there are more than 3 nodes, the speed doesn't increase,Is that right?

I don't know if my understanding is correct.Thanks for any comment!

Menkot
  • 694
  • 1
  • 6
  • 17

2 Answers2

0

You're assuming a large file is splittable to begin with, which isn't always the case.

If your files are ever less than a block size, adding more nodes will never increase processing time, it'll only help with replication and total cluster capacity.

Otherwise, your understanding seems correct, though, I think the latest default is actually 128 MB, not 64

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

Here is the answer for your query

I know the default block size is 64M,

In hadoop version 1.0 default size is 64MB and in version 2.0 default size is 128MB. the default block size can be overriden by setting valuue for parameter dfs.block.size in the configuration file hdfs-site.xml.

split is 64M,

Not necessary, as block size is not same as split size. Read this post for more clarity. For a normal wordcount example program, we can safely assume that the split size is approximately same as block size.

then for files less than 64M , when the number of nodes increase from 1 to 6 , there will be only one node to do with the split, so the speed will not improve? Is that right?

Yes you are right. If the file size is actually less than block size, then it would be processed by one node, and increasing node from 1 to 6 may not affect the execution speed. However, you must consider the case of speculative execution. In the case of speculative execution, even a smaller file may be processed by 2 nodes simultaneously and hence improve on speed of execution.

From Yahoo Dev KB link, Speculative execution is explained as below:

Speculative execution:

One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively using old API, while with newer API you may consider changing mapreduce.map.speculative and mapreduce.reduce.speculative.

Gyanendra Dwivedi
  • 5,511
  • 2
  • 27
  • 53