1

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.

I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).

With the python sdk it seems i only have the options to download file to a file and as a string.

I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.

I am in no way a python pro, so i might have missed an obvious way of doing this.

Martin Hansen
  • 2,033
  • 1
  • 18
  • 34
  • 3
    Is this something you need to do on a recurring basis, or just once? If the latter (or if you don't mind scripting), you could use gsutil to do it: gsutil -m cp -r gs://your-gcs-bucket s3://your-s3-bucket – Mike Schwartz Oct 23 '17 at 15:17
  • I need to do this on a recurring basis, hence why i would like it in code rather than scripting it in bash with gsutil. (we use data pipeline and/or airflow to do these kinds of things, and the gsutil/google sdk is a pain do setup from scratch. – Martin Hansen Oct 24 '17 at 07:14
  • If you are using Airflow, why not use the bash operator and the gsutil command? An example of something similar is here https://stackoverflow.com/a/53248802/435089 – Kannappan Sirchabesan Jan 18 '19 at 16:38

2 Answers2

2

I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.

This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.

def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")

bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)

data = BytesIO()
blob.download_to_file(data)
data.seek(0)

s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)

If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.

Martin Hansen
  • 2,033
  • 1
  • 18
  • 34
2

Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?