I am using gcloud storage cp
for transferring large amount of data from source bucket to destination bucket.
I am using --no-clobber
option to skip the existing file if copied already.
gcloud storage cp -r --no-clobber "gs://test-1/*" "gs://test-2" --encryption-key=XXXXXXXXXXXXXXXX --storage-class=REGIONAL
One of the challenge is I am moving terabytes of data(all files of size KiloBytes) from one bucket to another bucket and source bucket is encrypted with CSEK(Customer supplied encryption keys).
GCP data transfer service doesn't work for buckets encrypted with CSEK.
Since I am aware that it will take lots of time so I will start this process on long running VMs.
Now in case of intermittent network or zonal failures, we might have to restart the gcloud storage cp
command.
For e.g. Copying from gs://test-1 to gs://test-2 took ~7.35 hours(with 837136
files | 3.5GiB
size) from my local machine(Apple MacBook Pro M1 with 32GB RAM). Time taken was relatively high and can be due to overhead of encryption and decryption in cloud.
With --no-clobber
, it will still make an call to see if an bucket object exists already which is an classB operation and will cost if all the millions of objects are retried again to check if an object exists on destination bucket or not.
Class B Operations
storage.*.get
storage.*.getIamPolicy
storage.*.testIamPermissions
storage.*AccessControls.list
storage.notifications.list
I checked we have a mechanism of manifest file but it didn't work in my case for buckets with CSEK. If manifest file can skip the files directly then it will be fantastic.
Is there way to store offset and continue next time from that offset instead of checking all the objects if they exist first ?