8

I currently have a script that reads the existing version of a csv saved to s3, combines that with the new rows in the pandas dataframe, and then writes that directly back to s3.

    try:
        csv_prev_content = str(s3_resource.Object('bucket-name', ticker_csv_file_name).get()['Body'].read(), 'utf8')
    except:
        csv_prev_content = ''

    csv_output = csv_prev_content + curr_df.to_csv(path_or_buf=None, header=False)
    s3_resource.Object('bucket-name', ticker_csv_file_name).put(Body=csv_output)

Is there a way that I can do this but with a gzip compressed csv? I want to read an existing .gz compressed csv on s3 if there is one, concatenate it with the contents of the dataframe, and then overwrite the .gz with the new combined compressed csv directly in s3 without having to make a local copy.

rosstripi
  • 573
  • 1
  • 9
  • 19

3 Answers3

19

Here's a solution in Python 3.5.2 using Pandas 0.20.1.

The source DataFrame can be read from a S3, a local CSV, or whatever.

import boto3
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper

df = pd.read_csv('s3://ramey/test.csv')
gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')
s3_object.put(Body=gz_buffer.getvalue())
KT.
  • 10,815
  • 4
  • 47
  • 71
ramhiser
  • 3,342
  • 3
  • 23
  • 29
  • 1
    Works fine for me. Any ideas why the parameter 'compression' is not working in df.to_csv() when writing to stream? – szu Jun 13 '18 at 20:47
  • 2
    @szu because pandas.io.formats.format.CSVFormatter is implemented this way. I presume the idea is that if you are providing a stream, you might as well set up the processing as necessary yourself.... or the developers simply did not care to provide and test a "sensible" implementation there and no one cared to fix this so far (the sad thing is that you cannot easily fix it now without potentially breaking other people's code). – KT. Nov 01 '18 at 19:20
2

There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)

import pandas as pd
from smart_open import open

df.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)
-1

If you want streaming writes (to not hold (de)compressed CSV in memory), you can do this:

import s3fs
import io
import gzip

    def write_df_to_s3(df, filename, path):
        s3 = s3fs.S3FileSystem(anon=False)
        with s3.open(path, 'wb') as f:
            gz = gzip.GzipFile(filename, mode='wb', compresslevel=9, fileobj=f)
            buf = io.TextIOWrapper(gz)
            df.to_csv(buf, index=False, encoding='UTF_8')
            gz.flush()
            gz.close()

TextIOWrapper is needed until this issue is fixed: https://github.com/pandas-dev/pandas/issues/19827

user582175
  • 954
  • 12
  • 17
  • This answer appears to work, however it would be improved by showing how to call it. – Sidney Mar 12 '20 at 17:15
  • In addition, looking at the docs for gzip.Gzipfile it says either filename or fileobj neec to be set to a non-trivial value (which I assume is none). Could the filename parameter of the function be removed, and the call to GzipFile pass None in it's place, since the fileobj is set to the filehand from s3.open? – Sidney Mar 12 '20 at 17:16