1

I am retrieving data files from a FTP server in a loop with the following code:

   response = urllib.request.urlopen(url)
    data = response.read()
    response.close()
    compressed_file = io.BytesIO(data)
    gin = gzip.GzipFile(fileobj=compressed_file)

Retrieving and processing the first few works fine, but after a few request I am getting the following error:

    530 Maximum number of connections exceeded.

I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?

Gilbert
  • 177
  • 2
  • 12
  • Did just try contextlib.closing() suggested in http://stackoverflow.com/questions/1522636/should-i-call-close-after-urllib-urlopen by @Alex Martelli This did not work .. – Gilbert May 27 '16 at 22:32
  • How many connections are you actually opening? Do you really need that many? – tripleee May 27 '16 at 22:50
  • Does the server impose any limits on the amount of connections you can make in a specific time interval? – noisypixy May 27 '16 at 23:13
  • I dont know how many connections are allowed. This is the server: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2014/ Could I reuse a connection? – Gilbert May 28 '16 at 05:46

2 Answers2

2

Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close. ftplib is more appropriate I think.

Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser). I think it is also clear enough to serve as a general answer.

import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser

ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()

# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)

with ftplib.FTP(host=ftp_host) as ftpconn:
    ftpconn.login()

    for year in YEARS:
        ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
        print(ftp_file)

        # read the whole file and save it to a BytesIO (stream)
        response = io.BytesIO()
        try:
            ftpconn.retrbinary('RETR '+ftp_file, response.write)
        except ftplib.error_perm as err:
            if str(err).startswith('550 '):
                print('ERROR:', err)
            else:
                raise

        # decompress and parse each line 
        response.seek(0) # jump back to the beginning of the stream
        with gzip.open(response, mode='rb') as gzstream:
            for line in gzstream:
                parser.loads(line.decode('latin-1'))

This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.

travc
  • 1,788
  • 1
  • 18
  • 9
0

Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py

from subprocess import call
with open('log.txt', 'a') as f: call(['python', 'test.py', args[0], args[1]], stdout=f)

Gilbert
  • 177
  • 2
  • 12