14

I am trying to read a file from an FTP server. The file is a .gz file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.

I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?

When trying StringIO I got the error:

>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:\Python27\lib\ftplib.py", line 117, in __init__
self.connect(host)
File "C:\Python27\lib\ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed

I just need to know how can I get data into some variable and loop on it until the file from FTP is read.

I appreciate your time and help. Thanks!

Community
  • 1
  • 1
smandape
  • 1,033
  • 2
  • 14
  • 31
  • Do you need to read the file into a local buffer (like read()) or to manipulate it remotely using FTP commands? – Stefano Sanfilippo Sep 12 '13 at 19:27
  • I want to manipulate it remotely using FTP. Correct me if I am wrong, but if I read it into local buffer would that mean downloading the file? – smandape Sep 12 '13 at 19:28
  • I mean, you want to transfer data from the FTP server to your PC and then use that, is this right? (that's what happens in the SO question you linked) – Stefano Sanfilippo Sep 12 '13 at 19:32
  • I am sorry for the confusion but I don't want to transfer data from server on my PC. – smandape Sep 12 '13 at 19:33
  • So, do you want to process data on the server and then transfer results on your PC? Or what? Please clarify. – Stefano Sanfilippo Sep 12 '13 at 19:35
  • Exactly, that is what I am looking for. Go to the server take that file and process it right there and get the results back on my PC. – smandape Sep 12 '13 at 19:35
  • So, for instance, you want to extract the third column from each row of the CSV and then retrieve only a list of "third columns"? – Stefano Sanfilippo Sep 12 '13 at 19:36
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37280/discussion-between-smandape-and-stefano-sanfilippo) – smandape Sep 12 '13 at 19:37

3 Answers3

30

Make sure to login to the ftp server first. After this, use retrbinary which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.

from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
    data.append(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)

Bonus points: how about we decompress the string while we're at it?

Easy mode, using data string above

import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()

Little bit better, full solution:

from ftplib import FTP
import gzip
import StringIO

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

sio = StringIO.StringIO()
def handle_binary(more_data):
    sio.write(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)

uncompressed = zippy.read()

In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).

Kyle Kelley
  • 13,804
  • 8
  • 49
  • 78
  • Thanks for the answer. I got a quick question, does this download the data on my computer or not? If not where it holds the data? – smandape Sep 12 '13 at 20:16
  • It holds it in memory, within a string named data (or uncompressed if you go the whole way). – Kyle Kelley Sep 12 '13 at 20:17
  • So, the final variable that holds the data would be uncompressed, right? – smandape Sep 12 '13 at 20:22
  • Yeah, with the bottom section of code, the data would be uncompressed. Bam! You now have a csv file you can parse with the built in csv library or the amazing pandas library. – Kyle Kelley Sep 12 '13 at 20:23
  • No problem. I have the code in a notebook too: http://nbviewer.ipython.org/url/nb.fict.io/7badff02-0444-4dfc-ad69-be349a8c8400 – Kyle Kelley Sep 12 '13 at 20:26
  • @smandape please note that you are not processing data remotely as you had asked... – Stefano Sanfilippo Sep 13 '13 at 20:11
  • 2
    i'm not sure why, but one as to replace StringIO with BytesIO to have this snipped working with Python 3.4 – tagoma Feb 10 '15 at 14:22
  • Using BytesIO works fine for extracting 4 byte reals for me with python 3.4. Nice work Kyle – captain_M Apr 22 '16 at 19:32
  • 1
    You actually do not need the `handle_binary` function. Just use `callback=data.append` or `callback=sio.write`, respectively. – Martin Prikryl Apr 15 '19 at 06:07
  • `data = "".join(data)` => `TypeError: sequence item 0: expected str instance, bytes found` – juanmf May 24 '20 at 22:56
6

There are two easy ways I can think of to download a file using FTP and store it locally:

  1. Using ftplib:

    from ftplib import FTP
    
    ftp = FTP('ftp.ncbi.nlm.nih.gov')
    ftp.login()
    ftp.cwd('pub/pmc')
    ftp.retrbinary('RETR PMC-ids.csv.gz', open('PMC-ids.csv.gz', 'wb').write)
    ftp.quit()
    
  2. Using urllib

    from urllib import urlretrieve
    
    urlretrieve("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz", "PMC-ids.csv.gz")
    

If you don't want to download and store it to a file, but you want to process it gradually as it comes, I suggest using urllib2:

from urllib2 import urlopen

u = urlopen("ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/readme.txt")

for line in u:
   print line

which prints your file line by line.

nickie
  • 5,608
  • 2
  • 23
  • 37
  • I could be wrong, but in option 1, wouldn't it overwrite the file with the next chunk if reading the binary takes more than one chunk? shouldn't the open be set as `'ab'` rather than `'wb'` – Tom Busby Jan 08 '15 at 21:19
  • 1
    @TomBusby, no, `'wb'` is just fine. Parameter passing in Python is eager (call-by-value). The callback passed to the `retrbinary` method is just the second parameter. It is eagerly computed, therefore `open(..., 'wb')` is evaluated just once and the `write` method of the returned file object is the callback that is passed to `retrbinary`. The file is opened just once for writing, not each time the callback is called, as you may have thought. – nickie Jan 21 '15 at 19:31
0

That is not possible. To process data on the server, you need to have some sort of execution permissions, be it for a shell script you would send or SQL access.

FTP is pure file transfer, no execution allowed. You will need either to enable SSH access, load the data into a Database and access that with queries or download the file with urllib then process it locally, like this:

import urllib
handle = urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
# Use data, maybe: buffer = handle.read()

In particular, I think the third one is the only zero-effort solution.

Stefano Sanfilippo
  • 32,265
  • 7
  • 79
  • 80
  • On second thought and on second careful reading of the comments exchanged between Kyle and Stefano, right below the question, I apologise for having downvoted this answer. However, it seems that what Kyle wanted to ask was not what he actually asked. If you read Stefano's answer as a reply to the original question, it doesn't seem to be true. In any case, if Stefano clarifies what he answered to (and edits the answer, to let me take back my negative vote), I'll be glad to make amends. – nickie Oct 03 '14 at 19:03