Here is the answer for your query
I know the default block size is 64M,
In hadoop version 1.0 default size is 64MB and in version 2.0 default size is 128MB. the default block size can be overriden by setting valuue for parameter dfs.block.size
in the configuration file hdfs-site.xml
.
split is 64M,
Not necessary, as block size is not same as split size. Read this post for more clarity. For a normal wordcount
example program, we can safely assume that the split size is approximately same as block size.
then for files less than 64M , when the number of nodes increase from 1 to 6 , there will be only one node to do with the split, so the speed will not improve? Is that right?
Yes you are right. If the file size is actually less than block size, then it would be processed by one node, and increasing node from 1 to 6 may not affect the execution speed. However, you must consider the case of speculative execution. In the case of speculative execution, even a smaller file may be processed by 2 nodes simultaneously and hence improve on speed of execution.
From Yahoo Dev KB link, Speculative execution is explained as below:
Speculative execution:
One problem with the Hadoop system is that by
dividing the tasks across many nodes, it is possible for a few slow
nodes to rate-limit the rest of the program. For example if one node
has a slow disk controller, then it may be reading its input at only
10% the speed of all the other nodes. So when 99 map tasks are already
complete, the system is still waiting for the final map task to check
in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual
tasks do not know where their inputs come from. Tasks trust the Hadoop
platform to just deliver the appropriate input. Therefore, the same
input can be processed multiple times in parallel, to exploit
differences in machine capabilities. As most of the tasks in a job are
coming to a close, the Hadoop platform will schedule redundant copies
of the remaining tasks across several nodes which do not have other
work to perform. This process is known as speculative execution. When
tasks complete, they announce this fact to the JobTracker. Whichever
copy of a task finishes first becomes the definitive copy. If other
copies were executing speculatively, Hadoop tells the TaskTrackers to
abandon the tasks and discard their outputs. The Reducers then receive
their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by setting the
mapred.map.tasks.speculative.execution
and
mapred.reduce.tasks.speculative.execution
JobConf
options to false,
respectively using old API, while with newer API you may consider changing mapreduce.map.speculative
and mapreduce.reduce.speculative
.