10

I am trying to download a CSV file (in-memory) from SFTP using Paramiko and import it into a pandas dataframe.

transport = paramiko.Transport((server, 22))
transport.connect(username=username, password=password)
sftp = paramiko.SFTPClient.from_transport(transport)

with open(file_name, 'wb') as fl:
    sftp.getfo(file_name, fl, callback=printTotals)
    df = pd.read_csv(fl, sep=' ')

The code below fails, telling me:

OSError: File is not open for reading

I assume that I need some kind of buffer or file like object for fl instead, since open needs a file. I am relatively new to all of this, so I would be happy it if someone could help.

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
lukas_o
  • 3,776
  • 4
  • 34
  • 50

2 Answers2

15

A simple solution that still allows you to use progress callback is:

  • Use BytesIO file-like object to store a downloaded file to memory;

  • You have to seek file pointer back to file start after downloading it, before you start reading it.

    with io.BytesIO() as fl:
        sftp.getfo(file_name, fl, callback=printTotals)
        fl.seek(0)
        df = pd.read_csv(fl, sep=' ')
    

Though with this solution, you will end up having the file loaded to memory twice.


Better solution is to implement a custom file-like object. It will even allow you to download and parse the file at the same time.

class FileWithProgress:

    def __init__(self, fl):
        self.fl = fl
        self.size = fl.stat().st_size
        self.p = 0

    def read(self, blocksize):
        r = self.fl.read(blocksize)
        self.p += len(r)
        print(str(self.p) + " of " + str(self.size)) 
        return r

And use it like:

with sftp.open(file_name, "rb") as fl:
    fl.prefetch()
    df = pd.read_csv(FileWithProgress(fl), sep=' ') 

For the SFTPFile.prefetch call, refer to:
Reading file opened with Python Paramiko SFTPClient.open method is slow
.


If you do not need the progress monitoring, simple code like this will do:

with sftp.open(file_name, "rb") as fl:
    fl.prefetch()
    df = pd.read_csv(fl, sep=' ') 
Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
  • 1
    that's exactly what I was looking for! I have found another solution in the mean time, which is not as good as I can't have a callback for reporting back the progress (see Answer below). Thank you! – lukas_o May 28 '18 at 12:36
3

What I ended up doing was a simple version of that, unfortunately without a callback for the progress, I also needed rb for reading:

with sftp.open(file_name, 'rb') as fl:
        df = pd.read_csv(fl, sep=' ')

Anyway, Martin's answer is what I was looking for!

lukas_o
  • 3,776
  • 4
  • 34
  • 50