I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion.
We do have enough resources in the cluster.
But when I have examined the container logs, I found it takes so much of time to copy small files. The file in question is a small file.
2019-10-23 14:49:09,546 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://service-namemode-prod-ab/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet to s3a://bucket-name/Data/abc/xyz/ava/abc/hello/GRP_part-00001-.snappy.parquet 2019-10-23 14:49:09,940 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://bucket-name/Data/.distcp.tmp.attempt_1571566366604_9887_m_000010_0
So what can be done to improve this with distcp to make the copy quicker?
Note: the same copy of data on the same cluster to Object Store (internal storage) not AWS S3, but similar to S3 took 4 mins for 98.6 GB.
Command :
hadoop distcp -Dmapreduce.task.timeout=0 -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=157286400 -Dfs.s3a.multipart.size=314572800 -Dfs.s3a.multipart.threshold=1073741824 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7290m -Dfs.s3a.max.total.tasks=1 -Dfs.s3a.threads.max=10 -bandwidth 1024 /abc/xyz/ava/ s3a://bucket-name/Data/
What can be optimized in terms of value here?
My cluster specs are as follows,
Allocate memory(Cumulative) - 1.2T
Available memory - 5.9T
Allocated VCores(Cumulative) - 119T
Available VCores - 521T
Configured Capacity - 997T
HDFS Used - 813T
Non-HDFS Used - 2.7T
Can anyone suggest a solution to overcome this issue, and suggest an optimal distcp conf for transferring 800 GB - 1 TB files usually from HDFS to Object Store.