Exporting data from Google Cloud Storage to Amazon S3

Question

I would like to transfer data from a table in BigQuery, into another one in Redshift. My planned data flow is as follows:

BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift

I know about Google Cloud Storage Transfer Service, but I'm not sure it can help me. From Google Cloud documentation:

Cloud Storage Transfer Service

This page describes Cloud Storage Transfer Service, which you can use to quickly import online data into Google Cloud Storage.

I understand that this service can be used to import data into Google Cloud Storage and not to export from it.

Is there a way I can export data from Google Cloud Storage to Amazon S3?

Mike Schwartz · Accepted Answer · 2016-09-06T16:03:34.840

50

You can use gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:

gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket

Note that the -d option above will cause gsutil rsync to delete objects from your S3 bucket that aren't present in your GCS bucket (in addition to adding new objects). You can leave off that option if you just want to add new objects from your GCS to your S3 bucket.

edited Sep 06 '16 at 16:03

answered Sep 05 '16 at 14:59

Mike Schwartz

11,511
1
33
36

Im getting an error for the same operation although the s3 bucket has public read and write access. Hope I'm not missing anything here. The gsutil was executed inside the google cloud shell. Error Message - ERROR 1228 14:00:22.190043 utils.py] Unable to read instance data, giving up Failure: No handler was ready to authenticate. 4 handlers were checked. ['HmacAuthV1Handler', 'DevshellAuth', 'OAuth2Auth', 'OAuth2ServiceAccountAuth'] Check your credentials. – Nirojan Selvanathan Jan 02 '18 at 07:05
Before that you need to add your aws credentials in boto.cfg file – MJK Apr 05 '18 at 10:30
The boto config fiile is used for credentials if you installed standalone gsutil, while the credential store is used if you installed gsutil as part of the Google Cloud SDK (https://cloud.google.com/storage/docs/gsutil_install#sdk-install) – Mike Schwartz Jan 31 '19 at 16:19
3

This works but unfortunately gsutil does not support multipart uploads, which the S3 API requires for files larger than 5GB. – Pathead Feb 19 '19 at 17:52
I'm running the above command on a google vm instance where download/upload speed is ~ 500-600 mbps, and the data to be migrated is 400gb. The process is taking very long. Is there any way I can speed up the migration? – raghav May 30 '19 at 10:05
raghav@ - you could shard the gsutil rsync, running on several VMs, with each handling a subset of the copying - see https://stackoverflow.com/questions/31492872/how-to-list-all-files-in-google-storage-bucket-in-a-short-time for an example of sharding (in that case it's listing, but hopefully that gives you an idea of how to do it for copying too). – Mike Schwartz Dec 12 '19 at 16:30
@MikeSchwartz Will this work if I need to transfer files to AWS Mainland China, Beijing region S3 bucket ?. I found that they allow S3 non-China to China region transfer (as in link below), but not sure on cross cloud https://aws.amazon.com/blogs/storage/transferring-amazon-s3-data-from-aws-regions-to-aws-regions-in-china/ – Rishabh Rusia Dec 04 '21 at 12:42

score 16 · Answer 2 · edited Feb 21 '19 at 11:32

16

Go to any instance or cloud shell in GCP

First of all configure your AWS credentials in your GCP

aws configure

if this is not recognising the install AWS CLI follow this guide https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

follow this URL for AWS configure https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Attaching my screenshot

Then using gsutil

gsutil -m rsync -rd gs://storagename s3://bucketname

16GB data transferred in some minutes

edited Feb 21 '19 at 11:32

answered Feb 18 '19 at 11:14

Noordeen

1,547
20
26

is it possible to install aws cli in google cloud shell? if so can you tell me how – Andrew Irwin Feb 21 '20 at 16:54
You can just execute curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install – Chris Oct 21 '22 at 02:07
1

This works ok, although I got a load of 'connection refused' and 'broken pipe' errors doing this from OS X so many retries have been necessary with rsync -i to ignore files that already copied ok (OS X single process mode is more reliable but slooow). Also I got an MD5 mismatch error that stopped many files from copying over. Error was 'md5 signature for source object doesn't match destination object digest'. This can be resolved by specifying the encryption type in the command: gsutil -h "x-amz-server-side-encryption: AES256" -m rsync -rdi gs://storagename s3://bucketname – urchino Oct 24 '22 at 16:38
My current estimate is that it will take approximately a week to transfer 250Gb of data using this mechanism. It produces a lot of errors. Random 'Connection reset by peer' is the latest. Note that all the errors are on the gsync to google side - you get the same behaviour if you try to take a local copy of the files without using S3 at all. The GSUtil cmd tool is very flaky and should be avoided if at all possible. – urchino Oct 24 '22 at 18:16

score 4 · Answer 3 · edited Apr 03 '17 at 16:05

4

Using Rclone (https://rclone.org/).

Rclone is a command line program to sync files and directories to and from

Google Drive
Amazon S3
Openstack Swift / Rackspace cloud files / Memset Memstore
Dropbox
Google Cloud Storage
Amazon Drive
Microsoft OneDrive
Hubic
Backblaze B2
Yandex Disk
SFTP
The local filesystem

edited Apr 03 '17 at 16:05

Mikhail Berlyant

165,386
8
154
230

answered Apr 03 '17 at 16:02

Itsites

41
1

Javeed Shakeel · Answer 4 · 2021-04-15T08:53:56.657

Using the gsutil tool we can do a wide range of bucket and object management tasks, including:

Creating and deleting buckets.
Uploading, downloading, and deleting objects.
Listing buckets and objects. Moving, copying, and renaming objects.

we can copy data from a Google Cloud Storage bucket to an amazon s3 bucket using gsutil rsync and gsutil cp operations. whereas

gsutil rsync collects all metadata from the bucket and syncs the data to s3

gsutil -m rsync -r gs://your-gcs-bucket s3://your-s3-bucket

gsutil cp copies the files one by one and as the transfer rate is good it copies 1 GB in 1 minute approximately.

gsutil cp gs://<gcs-bucket> s3://<s3-bucket-name>

if you have a large number of files with high data volume then use this bash script and run it in the background with multiple threads using the screen command in amazon or GCP instance with AWS credentials configured and GCP auth verified.

Before running the script list all the files and redirect to a file and read the file as input in the script to copy the file

gsutil ls gs://<gcs-bucket> > file_list_part.out

Bash script:

#!/bin/bash
echo "start processing" 
input="file_list_part.out"
while IFS= read -r line
do
    command="gsutil cp ${line} s3://<bucket-name>"
    echo "command :: $command :: $now"
    eval $command
    retVal=$?
    if [ $retVal -ne 0 ]; then
        echo "Error copying file"
        exit 1
    fi
    echo "Copy completed successfully"
done < "$input"
echo "completed processing"

execute the Bash script and write the output to a log file to check the progress of completed and failed files.

bash file_copy.sh > /root/logs/file_copy.log 2>&1

score 3 · Answer 5 · answered Jan 05 '19 at 05:56

3

I needed to transfer 2TB of data from Google Cloud Storage bucket to Amazon S3 bucket. For the task, I created the Google Compute Engine of V8CPU (30 GB).

Allow Login using SSH on the Compute Engine. Once logedin create and empty .boto configuration file to add AWS credential information. Added AWS credentials by taking the reference from the mentioned link.

Then run the command:

gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket

The data transfer rate is ~1GB/s.

Hope this help. (Do not forget to terminate the compute instance once the job is done)

answered Jan 05 '19 at 05:56

Raxit Solanki

434
6
15

wanted to know more on the files size, count and total data you have migrated with ~1 GB/s data transfer – Lakhan Kriplani May 13 '20 at 09:51
I used GH Archive project's data -> https://www.gharchive.org/ ... It was yearly data transfer first into Google Cloud storage, and then sync to S3 bucket. Each date file in year bucket is in ~MBs...!! – Raxit Solanki May 15 '20 at 06:56
But why did you use a compute engine? What's it's exact role in this setup? @RaxitSolanki – Anna Leonenko Jun 15 '20 at 14:28
cool that you figured it out. please give a thumbs up to answer if it was helpful :) – Raxit Solanki Jun 15 '20 at 19:27

score 0 · Answer 6 · answered Jun 30 '21 at 22:06

For large amounts of large files (100MB+) you might get issues with broken pipes and other annoyances, probably due to multipart upload requirement (as Pathead mentioned).

For that case you're left with simple downloading all files to your machine and uploading them back. Depending on your connection and data amount, it might be more effective to create VM instance to utilize high-speed connection and ability to run it in the background on different machine than yours.

Create VM machine (make sure the service account has access to your buckets), connect via SSH and install AWS CLI (apt install awscli) and configure the access to S3 (aws configure).

Run these two lines, or make it a bash script, if you have many buckets to copy.

gsutil -m cp -r "gs://$1" ./
aws s3 cp --recursive "./$1" "s3://$1"

(It's better to use rsync in general, but cp was faster for me)

swooders · Answer 7 · 2022-10-11T05:38:40.713

0

Tools like gsutil and aws s3 cp won't use multipart uploads/downloads, so will have poor performance for large files.

Skyplane is a much faster alternative for transferring data between clouds (up to 110x for large files). You can transfer data with the command:

skyplane cp -r s3://aws-bucket-name/ gcs://google-bucket-name/

(disclaimer: I am a contributor)

edited Oct 11 '22 at 05:38

answered Oct 10 '22 at 18:06

swooders

141
1
8

Neat! If I understand correctly, Skyplane creates a bunch of VMs on our behalf using our accounts to speed up the transfer, and the fees are on us, right? – Jerther May 16 '23 at 19:01

Exporting data from Google Cloud Storage to Amazon S3

7 Answers7

Linked

Related