I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip')
, and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.