3

Problem

I'm trying to download >100.000 files from a ftp server in parallel (using threads). I previously tried it with urlretrieve as answered here, however this gave me the following error: URLError(OSError(24, 'Too many open files')). Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen in combination with shutil and then write it to file which I could close myself, as described here. This seemed to work fine, but then I got the same error again: URLError(OSError(24, 'Too many open files')). I thought whenever writing to a file is incomplete or will fail the with statement will cause to file to close itself, but seemingly the files still keep open and will eventually cause the script to halt.

Question

How can I prevent this error, i.e. make sure that every files get closed?

Code

import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool

def url_to_filename(url):
    filename = 'patric_genomes/' + url.split('/')[-1]
    return filename

def download(url):
    url = url.strip()
    try:
        with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
            shutil.copyfileobj(response, out_file)
    except Exception as e:
        return None, e

def build_urls(id_list):
    base_url = 'ftp://some_ftp_server/'
    urls = []
    for some_id in id_list:
        url = base_url + some_id + '/' + some_id + '.fna'
        print(url)
        urls.append(url)
    return urls


if __name__ == "__main__":
    with open('full_data/genome_ids.txt') as inFile:
        reader = csv.DictReader(inFile, delimiter = '\t')
        ids = {row['some_id'] for row in reader}
        urls = build_urls(ids)
        p = Pool(100)
        print(p.map(download, urls)) 
Community
  • 1
  • 1
CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • Possible duplicate of https://stackoverflow.com/questions/45665991/multiprocessing-returns-too-many-open-files-but-using-with-as-fixes-it-wh – match Feb 03 '18 at 13:20
  • Are you saving all the files to the same folder? – Tarun Lalwani Feb 11 '18 at 16:26
  • @CodeNoob, what does "ulimit -a" show? – Oleg Kuralenko Feb 11 '18 at 20:27
  • 1
    `100.000 files from a ftp server in parallel` **You're a bad joker.** You can download only 2 files at a time, making it easier to provide backward control. There is no difference between downloading the 100 files at the same time and downloading 2 files. Depends on the limit of system resources. Installing an irrelevant install of the system will affect the services that are bad. You can force it on the Windows system, but Linux shows you the door directly. – dsgdfg Feb 12 '18 at 07:46
  • You also need to know how much system load the 100 file is doing and the service/system limits. Make sure that "if you used the first versions of the Linux system, you wouldn't ask these questions!" **What is the definition of software development?** – dsgdfg Feb 12 '18 at 07:47

3 Answers3

3

You may try to use contextlib to close your file as such:

import contextlib
[ ... ]

with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

[ ... ]

According to the docs:

contextlib.closing(thing)

    Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.

*** A workaround would be raising the open files limit on your Linux OS. Check your current open files limit:

ulimit -Hn

Add the following line in your /etc/sysctl.conf file:

fs.file-max = <number>

Where <number> is the new upper limit of open files you want to set. Close and save the file.

sysctl -p

So that changes take effect.

AnythingIsFine
  • 1,777
  • 13
  • 11
  • Although the other answers advice what I could do better (which I appreciate) I think this answer adresses the problem I had. – CodeNoob Feb 16 '18 at 10:22
1

I believe that file handlers you create are not disposed in time by the system, as it takes some time to close connection. So you end up using all the free file handlers (and that includes network sockets) very quickly.

What you do is setting up FTP connection for each of your files. This is a bad practice. A better way is opening 5-15 connections and reusing them, downloading the files through existing sockets, without the overhead of initial FTP handshaking for each file. See this post for reference.

P.S. Also, as @Tarun_Lalwani mentioned, it is not a good idea to create a folder with more than a ~1000 files in it, as it slows down the file system.

igrinis
  • 12,398
  • 20
  • 45
0

How can I prevent this erorr, i.e. make sure that every files get closed?

To prevent the error you need to either increase open file limit, or, which is more reasonable, decrease concurrency in your thread pool. Connection and file closing is done by the context manager properly.

Your thread pool has 100 threads and opens at least 200 handles (one for FTP connection and another for file). Reasonable concurrency would be about 10-30 threads.

Here's simplified reproduction which shows that the code is okay. Put some content in somefile in current directory.

test.py

#!/usr/bin/env python3

import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool


def download(id):
    ftp_url = sys.argv[1]
    filename = Path(__name__).parent / 'files'
    try:
        with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
            shutil.copyfileobj(src, dst)
    except Exception as e:
        logging.exception('Download error')


if __name__ == '__main__':
    with ThreadPool(10) as p:
        p.map(download, (str(i).zfill(4) for i in range(1000)))

And then in the same directory:

$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
  -v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass@$IP/dir/somefile
$ python3 client.py ftp://user:pass@$IP/dir/somefile    
$ docker stop ftp-test && docker rm -v ftp-test
saaj
  • 23,253
  • 3
  • 104
  • 105
  • could you please elebarote on what the code does? I don't understand the code beneath: "and then in the same directory" – CodeNoob Feb 15 '18 at 13:32
  • @CodeNoob it start one-off Docker container with FTP server with one file, makes sure it's been set up correctly with `curl`, and then runs the Python script which downloads the file 1000 times in 10 threads. Then cleans up the container. – saaj Feb 15 '18 at 15:06