I am trying to move data between two hadoop clusters using distcp
. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic
, which according to the documentation, 'allows faster data-nodes to copy more bytes than slower nodes'.
I am setting the number of mappers to 400. when I launch the job, I see this error: java.io.IOException: Too many chunks created with splitRatio:2, numMaps:400. Reduce numMaps or decrease split-ratio to proceed.
when I googled about it, I found this link: https://issues.apache.org/jira/browse/MAPREDUCE-5402
In this link the author asks for a feature where we can increase distcp.dynamic.max.chunks.tolerable
to resolve the issue.
The ticket says issue was resolved in the version 2.5.0
. The hadoop version I am using is 2.7.3
. So I believe it should be possible for me to increase the value of distcp.dynamic.max.chunks.tolerable
.
However, I am not sure how can I increase that. Can this configuration be updated for a single distcp job by passing it like -Dmapreduce.job.queuename
or do I have to update it on mapred-site.xml
? Any help would be appreciated.
Also does this approach work well if there are a large number of small files? are there any other parameters I can use to make it faster? Any help would be appreciated.
Thank you.