27

I would like to download file over HTTP protocol using urllib3. I have managed to do this using following code:

 url = 'http://url_to_a_file'
 connection_pool = urllib3.PoolManager()
 resp = connection_pool.request('GET',url )
 f = open(filename, 'wb')
 f.write(resp.data)
 f.close()
 resp.release_conn()

But I was wondering what is the proper way of doing this. For example will it work well for big files and If no what to do to make this code more bug tolerant and scalable.

Note. It is important to me to use urllib3 library not urllib2 for example, because I want my code to be thread safe.

running.t
  • 5,329
  • 3
  • 32
  • 50

3 Answers3

41

Your code snippet is close. Two things worth noting:

  1. If you're using resp.data, it will consume the entire response and return the connection (you don't need to resp.release_conn() manually). This is fine if you're cool with holding the data in-memory.

  2. You could use resp.read(amt) which will stream the response, but the connection will need to be returned via resp.release_conn().

This would look something like...

import urllib3
http = urllib3.PoolManager()
r = http.request('GET', url, preload_content=False)

with open(path, 'wb') as out:
    while True:
        data = r.read(chunk_size)
        if not data:
            break
        out.write(data)

r.release_conn()

The documentation might be a bit lacking on this scenario. If anyone is interested in making a pull-request to improve the urllib3 documentation, that would be greatly appreciated. :)

shazow
  • 17,147
  • 1
  • 34
  • 35
  • And one more question. Will it work with POST method if I add `r = http.request('POST', url)`? – running.t Jun 24 '13 at 22:40
  • @running.t Err, that was a mistake in my code. You're right, the method should go first, and your snippet will work. (Updated my answer.) – shazow Jun 25 '13 at 22:03
  • I tried the above code today using urllib3 1.15.1. It needs two modifications to be 100% correct. First, you need `preload_content=False` in `http.request('GET', url, ...)`. Second, `if data is None` should be `if not data` to take into account that `data` being an empty string, not `None`. Otherwise, it works perfectly. Thank you. I also want to thank @Alecz below for providing more clues. – Nick Lee Jun 02 '16 at 09:41
  • Works well! What's a reasonable chunk size? – Andrew Feather May 19 '17 at 16:39
  • 2
    Good question. 64kb is probably a safe choice (2**16 or 65536). – shazow May 29 '17 at 16:50
  • 2
    Is there a reason for `while` looping when `for data in request.read(chunk_size)\n\tout.write(data)` *seems* to achieve the same results? – S0AndS0 Feb 25 '19 at 05:15
9

The most correct way to do this is probably to get a file-like object that represents the HTTP response and copy it to a real file using shutil.copyfileobj as below:

url = 'http://url_to_a_file'
c = urllib3.PoolManager()

with c.request('GET',url, preload_content=False) as resp, open(filename, 'wb') as out_file:
    shutil.copyfileobj(resp, out_file)

resp.release_conn()     # not 100% sure this is required though
Alecz
  • 1,951
  • 1
  • 19
  • 18
  • 2
    Doing `resp.release_conn()` with `preload_content=False` is required so that the connection can be reused by the pool manager. See [Streaming and IO](https://urllib3.readthedocs.io/en/latest/advanced-usage.html#streaming-and-io). – Uyghur Lives Matter Aug 19 '19 at 19:31
  • According to documentation `resp.release_conn()` seems not required. This is the description of the release_conn parameter: _If False, then the urlopen call will not release the connection back into the pool once a response is received (**but will release if you read the entire contents of the response** such as when preload_content=True)._ – Ernesto Dec 30 '19 at 17:19
3

Most easy way with urllib3, you can use shutil do auto-manage packages.

import urllib3
import shutil

http = urllib3.PoolManager()
with open(filename, 'wb') as out:
    r = http.request('GET', url, preload_content=False)
    shutil.copyfileobj(r, out)
Marco Lampis
  • 403
  • 5
  • 15