How to parallelize file downloads?

Question

I can download a file at a time with:

import urllib.request

urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']

for u in urls:
  urllib.request.urlretrieve(u)

I could try to subprocess it as such:

import subprocess
import os

def parallelized_commandline(command, files, max_processes=2):
    processes = set()
    for name in files:
        processes.add(subprocess.Popen([command, name]))
        if len(processes) >= max_processes:
            os.wait()
            processes.difference_update(
                [p for p in processes if p.poll() is not None])

    #Check if all the child processes were closed
    for p in processes:
        if p.poll() is None:
            p.wait()

urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']

parallelized_commandline('wget', urls)

Is there any way to parallelize urlretrieve without using os.system or subprocess to cheat?

Given that I must resort to the "cheat" for now, is subprocess.Popen the right way to download the data?

When using the parallelized_commandline() above, it's using multi-thread but not multi-core for the wget, is that normal? Is there a way to make it multi-core instead of multi-thread?

the other thing you could look at is threads, i know it can slow down some tasks, but for IO bound tasks it usually results in a speedup as it switches to other threads when the IO starts to block, I'm not sure if it would help in this case, but you could give it a go — James Kent, Aug 03 '15 at 10:20
or *cheat* by launching each in `screen` ... if you don't need to sync on them being done. `subprocess.call(['screen','-S','sleepx','-dm','sleep','876543210'])` — Skaperen, Aug 03 '15 at 10:33
Pls take a look at http://stackoverflow.com/questions/18377475/asynchronously-get-and-store-images-in-python — Slam, Aug 03 '15 at 10:42
related: [How to download a few files simultaneusly from ftp in Python](http://stackoverflow.com/q/16140921/4279) — jfs, Aug 03 '15 at 19:41

jfs · Accepted Answer · 2015-08-03T20:52:44.917

57

You could use a thread pool to download files in parallel:

#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

You could also download several files at once in a single thread using asyncio:

#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

@asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

where url2filename() is defined here.

edited Aug 03 '15 at 20:52

answered Aug 03 '15 at 19:26

jfs

399,953
195
994
1,670

How to download images to specific directory? – neel Aug 26 '17 at 12:26
@neel do you see filename in the code? Replace it with `os.path.join(destination_directory, filename)`. If it is unclear; ask a separate Stack Overflow question. – jfs Aug 26 '17 at 12:42
Sorry if it was not clear, I was asking in the Pool. – neel Aug 26 '17 at 12:46
@neel define your own `fetch_url()`, [example](https://stackoverflow.com/a/27986480/4279) and pass it to the pool instead of `urlretrieve()`. You could use `urlretrieve(url, os.path.join(destination_directory, filename))` inside your fetch_url() function. – jfs Aug 26 '17 at 12:52
Similarly how to upload files block by block in parallel using python ? – Ashish Karpe Sep 15 '17 at 01:52
@AshishKarpe if you haven't found an existing question about it, ask a new Stack Overflow question (this question is about downloading) – jfs Sep 15 '17 at 05:04
@jfs Hi, how can I get data (binary I think?) in list of result? Because result is list, and inside list is tuple. Inside the tupple I cannot get the file data (binary or base64 or something like that) – Ardi Nusawan Jul 13 '18 at 08:18
*Edit: nvm, now i got it :) The file just saved in folder where is script running – Ardi Nusawan Jul 13 '18 at 08:40
Is it possible to download using asyncio and wget. I tried the above answer but downloads are incomplete for many urls. It could be the source site issue but when I tried looping using wget, I got complete files. – Vinay Apr 08 '20 at 16:02

How to parallelize file downloads?

1 Answers1

Linked