writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

Question

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.

I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:

fileList = source_sftp.listdir('/Inbox/')   
dataList = []
for item in fileList: # for each file in the list...
    print(item)
    if item[-3:] == u'csv':
        temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
    elif item[-3:] == u'zip':
        temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
    elif item[-3:] == u'.gz':
        temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
    else:
        temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
    dataList.append(temp) # keep each 
#... Some transformations in here on the data 
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
      for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0     
for item in dataList:
    with dest_sftp.open(locpath + FL[i], 'w') as f:
        item.to_csv(f, index=False,compression='gzip')
    i = i+1

It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?

If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.

I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.

score 0 · Answer 1 · answered Jul 14 '20 at 05:23

0

Make sure you have the latest version of Pandas.

It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

answered Jul 14 '20 at 05:23

Martin Prikryl

188,800
56
490
992

1

After updating to pandas 0.24.2, it seems I get the same message on the Github entry about 20 days ago by someone else. `RuntimeWarning: compression has no effect when passing file-like object as input.` I am on Python 2.7 currently, so I may just need to meander thru finding some versions to packages for my Python 3 env. and try something there that doesn't break something else. Although, this person seemed to indicate python 3.7 is having same issue on Github. I actually ran across that github issue notice before, so I assumed it was something else that we were both doing wrong. – maxzmiz Jul 21 '20 at 21:34

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

1 Answers1