I am a beginner in using boto3 and I'd like to compress a file that is on a s3 bucket without downloading it to my local laptop. It is supposed to be a streaming compression (Glue aws). Here you can find my three attempts. The first one would be the best one because it is, in my opinion, on stream (similar to "gzip.open" function).
First wrong attempt (gzip.s3.open does not exists...):
with gzip.s3.open('s3://bucket/attempt.csv','wb') as fo:
"operations (write a file)"
Second wrong attempt (s3fs gzip compression on pandas dataframe):
import gzip
import boto3
from io import BytesIO, TextIOWrapper
s3 = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='')
# read file
source_response_m = s3.get_object(Bucket=bucket,Key='file.csv')
df = pd.read_csv(io.BytesIO(source_response_m['Body'].read()))
# compress file
buffer = BytesIO()
with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)
# upload it
s3_resource = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
s3_object = s3_resource.Object(bucket, 'file.csv.gz')
s3_object.put(Body=buffer.getvalue())
Third wrong attempt (Upload Gzip file using Boto3 & https://gist.github.com/tobywf/079b36898d39eeb1824977c6c2f6d51e)
from io import BytesIO
import gzip
import shutil
import boto3
from tempfile import TemporaryFile
s3 = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
bucket = s3.Bucket('bucket')
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
"""Compress and upload the contents from fp to S3.
If compressed_fp is None, the compression is performed in memory.
"""
if not compressed_fp:
compressed_fp = BytesIO()
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
bucket.upload_fileobj(compressed_fp, key, {'ContentType': content_type, 'ContentEncoding': 'gzip'})
upload_gzipped(bucket,'folder/file.gz.csv', 'file.csv.gz')
Honestly I have no idea how to use the latter attempt. The doc I have found is not very clear and there are no examples.
Do you have any ideas/suggestions to overcome my issue?
Thanks in advance.
Solution
I was able to solve my issue using the link below. Hope it will be useful for you.
https://gist.github.com/veselosky/9427faa38cee75cd8e27
D