Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp

Question

I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which according to the documentation, 'allows faster data-nodes to copy more bytes than slower nodes'.

I am setting the number of mappers to 400. when I launch the job, I see this error: java.io.IOException: Too many chunks created with splitRatio:2, numMaps:400. Reduce numMaps or decrease split-ratio to proceed.

when I googled about it, I found this link: https://issues.apache.org/jira/browse/MAPREDUCE-5402 In this link the author asks for a feature where we can increase distcp.dynamic.max.chunks.tolerable to resolve the issue.

The ticket says issue was resolved in the version 2.5.0. The hadoop version I am using is 2.7.3. So I believe it should be possible for me to increase the value of distcp.dynamic.max.chunks.tolerable.

However, I am not sure how can I increase that. Can this configuration be updated for a single distcp job by passing it like -Dmapreduce.job.queuename or do I have to update it on mapred-site.xml ? Any help would be appreciated.

Also does this approach work well if there are a large number of small files? are there any other parameters I can use to make it faster? Any help would be appreciated.

Thank you.

score 2 · Answer 1 · answered Aug 16 '19 at 16:10

I was able to figure it out. A parameter can be passed with the distcp command instead of having to update the mapred-site.xml:

hadoop distcp -Ddistcp.dynamic.recordsPerChunk=50 -Ddistcp.dynamic.max.chunks.tolerable=10000 -skipcrccheck -m 400 -prbugc -update -strategy dynamic "hdfs://source" "hdfs://target"

Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp

1 Answers1