Gzip file compression and boto3

Question

I am a beginner in using boto3 and I'd like to compress a file that is on a s3 bucket without downloading it to my local laptop. It is supposed to be a streaming compression (Glue aws). Here you can find my three attempts. The first one would be the best one because it is, in my opinion, on stream (similar to "gzip.open" function).

First wrong attempt (gzip.s3.open does not exists...):

  with gzip.s3.open('s3://bucket/attempt.csv','wb') as fo: 
      "operations (write a file)"

Second wrong attempt (s3fs gzip compression on pandas dataframe):

import gzip
import boto3
from io import BytesIO, TextIOWrapper

s3 = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='')

# read file
source_response_m = s3.get_object(Bucket=bucket,Key='file.csv')
df = pd.read_csv(io.BytesIO(source_response_m['Body'].read()))

# compress file
buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
    df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)

# upload it
s3_resource = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
s3_object = s3_resource.Object(bucket, 'file.csv.gz')
s3_object.put(Body=buffer.getvalue())

Third wrong attempt (Upload Gzip file using Boto3 & https://gist.github.com/tobywf/079b36898d39eeb1824977c6c2f6d51e)

from io import BytesIO
import gzip
import shutil
import boto3
from tempfile import TemporaryFile


s3 = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
bucket = s3.Bucket('bucket')


def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
    """Compress and upload the contents from fp to S3.
    If compressed_fp is None, the compression is performed in memory.
    """
    if not compressed_fp:
        compressed_fp = BytesIO()
    with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
        shutil.copyfileobj(fp, gz)
    compressed_fp.seek(0)
    bucket.upload_fileobj(compressed_fp, key, {'ContentType': content_type, 'ContentEncoding': 'gzip'})


upload_gzipped(bucket,'folder/file.gz.csv', 'file.csv.gz')

Honestly I have no idea how to use the latter attempt. The doc I have found is not very clear and there are no examples.

Do you have any ideas/suggestions to overcome my issue?

Thanks in advance.

Solution

I was able to solve my issue using the link below. Hope it will be useful for you.

https://gist.github.com/veselosky/9427faa38cee75cd8e27

D

So, all these attempts seem to have the file on your system at some point. I'm not sure I have a full answer, but there are three strategies that come to mind: 1) accept you have to download the file, then zip it, then upload the zipped file 2) use an AWS lambda function to do the same with a machine in the cloud instead of downloading it to your machine or 3) (not sure about this) download the file chunk-by-chunk, compressing chunks and streaming the compressed parts back to the destination file as you go... — thclark, Nov 06 '19 at 22:40
Also... seen this? It might be exactly what you want: https://medium.com/@johnpaulhayes/how-extract-a-huge-zip-file-in-an-amazon-s3-bucket-by-using-aws-lambda-and-python-e32c6cf58f06 — thclark, Nov 06 '19 at 22:41
Also, this Q slightly duplicates: https://stackoverflow.com/questions/43275575/how-to-zip-files-in-amazon-s3-bucket-and-get-its-url — thclark, Nov 06 '19 at 22:46
@JohnRotenstein: Hi, thanks for your interests. The reason is that I would like to use this streaming compression in a python script that is supposed to run on a Glue aws service, which is serverless. Therefore I can't download the file locally before compress it and upload it afterward. Does it make sense for you? @ Thclark: I believe that the first two attempts should not download anything on my local machine. Maybe the third one, I haven't understood though. Secondly, I would like to use Lambda function. I'd rather user glue service. Ideas? — fdrigo, Nov 07 '19 at 09:17
@fdrigo Are you referring to use _AWS Glue Python Shell_ to run your code? Are you sure that no disk storage is available for scripts running in this shell? Would you run this code at the start of a job or at the end of a job? — John Rotenstein, Nov 07 '19 at 11:41
@JohnRotenstein: Yes I am refering to aws glue python shell to run my code. When everything works locally I'd like to make Glue to run my script. In response to you second question, I am quite sure there is no local storage associated with Glue service. The storage support can be a s3 bucket which is mounted as soon as the first python script in Glue is created. The gzip compression should be the last step of a cleaning data process via Python. In short, after such an operation I'd like to compress my final file and upload it either to a specific bucket or a sftp. Am I clear? Thanks — fdrigo, Nov 07 '19 at 13:37
I had an issue with this answer. The problem that occurred was the resulting file would identify as a .gz with the expected file size (when on S3). However when I downloaded and looked at the file it was actually the original file size and in plain text. the solution I found was to remove the third argument of the upload_fileobj which defines the ContentType and ContentEncoding. Then it all worked perfectly. I've no idea why. — Graeme Tate, Nov 08 '22 at 14:33

Gzip file compression and boto3

0 Answers0