0

I have a 5TB of data which need to transfer to GCP bucket using some command.

I tried using hadoop discp -m num -strategy dynamic source_path destination_path. It's still getting executed since long.

Is there any alternative to copy large data from HDFS location to GCP bucket using command.

I tried to execute distCp command on 50GB of data with different number of mappers, I use:

hadoop discp -m num -strategy dynamic source_path destination_path

I have tried with below options:

  • with -m 18 -> it took 16 mins
  • with -m 22 -> it took 12 mins
  • with -m 44 -> it took 18 mins
  • with -m 60 -> it took 5 mins 20 sec
  • with -m 72 -> it took 5 mins 9 sec
  • with -m 80 -> it took 5 mins 7 sec
  • with -m 84 -> it took 16 mins 10 sec
  • with -m 88 -> it took 11+ mins

Can someone please suggest some alternative to distcp.

James Z
  • 12,209
  • 10
  • 24
  • 44
  • 1
    Are you sure you don't have network bottleneck? In addition, what's your network bandwidth? – guillaume blaquiere Aug 03 '23 at 08:08
  • No, we don't have any network bottleneck. I am using 1+18 nodes cluster to copy data from hdfs to GCP bucket. Master node is of 500GB and each worker node is of 200GB. – Chetan Mane Aug 03 '23 at 11:59
  • 1
    Have a look at this stackoverflow [thread](https://stackoverflow.com/questions/48799657/hadoop-fs-du-gsutil-du-is-running-slow-on-gcp) and [document](https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-data#improving_data_migration_speed) – Sathi Aiswarya Aug 03 '23 at 12:55
  • 1
    Network performance can be affected by two primary issues: a) instance size. a larger instance can process data faster over the network; b) distance to the Cloud Storage bucket. Add details to your post. The number of nodes in your cluster does not matter for the most part. – John Hanley Aug 03 '23 at 19:16
  • FYI, Size is 200GB per worker node & 500GB per master node and they lie in the region and zone. Also, there are no failures or job getting stuck somewhere. It's just copy job is too slow. – Chetan Mane Aug 08 '23 at 06:41
  • Your comment does not provide the details I asked for (instance size and region, bucket region). Edit your post to provide details. Do not use comments. – John Hanley Aug 08 '23 at 06:43

0 Answers0