How to download a file over HTTP?

Question

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.

The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.

I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.

So, how do I download the file using Python?

See also: [How to save an image locally using Python whose URL address I already know?](http://stackoverflow.com/q/8286352/562769) — Martin Thoma, Mar 14 '16 at 11:24
Many of the answers below are not a satisfactory replacement for `wget`. Among other things, `wget` (1) preserves timestamps (2) auto-determines filename from url, appending `.1` (etc.) if the file already exists (3) has many other options, some of which you may have put in your `.wgetrc`. If you want any of those, you have to implement them yourself in Python, but it's simpler to just invoke `wget` from Python. — ShreevatsaR, Sep 27 '16 at 17:22
Short solution for Python 3: `import urllib.request; s = urllib.request.urlopen('http://example.com/').read().decode()` — Basj, Nov 26 '19 at 09:47
wget is still a better approach, if you need to automatically retrieve filename and timestamps and handling duplicating files as https://stackoverflow.com/users/4958/shreevatsar stated. If the urls are variables, one can still handle in python using subprocess. — Tendai, Mar 02 '23 at 08:19

score 1344 · Answer 1 · edited Jan 27 '23 at 12:17

1344

One more, using urlretrieve:

import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(for Python 2 use import urllib and urllib.urlretrieve)

edited Jan 27 '23 at 12:17

David Jones

4,766
3
32
45

answered Aug 22 '08 at 16:19

PabloG

25,761
10
46
59

1

Oddly enough, this worked for me on Windows when the urllib2 method wouldn't. The urllib2 method worked on Mac, though. – InFreefall May 15 '11 at 21:49
7

Bug: file_size_dl += block_sz should be += len(buffer) since the last read is often not a full block_sz. Also on windows you need to open the output file as "wb" if it isn't a text file. – Eggplant Jeff May 25 '11 at 17:53
1

Me too urllib and urllib2 didn't work but urlretrieve worked well, was getting frustrated - thanks :) – funk-shun Jul 12 '11 at 06:08
5

Wrap the whole thing (except the definition of file_name) with `if not os.path.isfile(file_name):` to avoid overwriting podcasts! useful when running it as a cronjob with the urls found in a .html file – Sriram Murali May 01 '12 at 20:15
I have a suggestion, using .format() instead of % string formatting and sys.stdout.write(): https://gist.github.com/3176958 – Savvas Radevic Jul 25 '12 at 16:06
for people who don't know, the second parameter of `urllib.urlretrieve` is the file it will be downloaded to. – QxQ Oct 16 '12 at 22:49
This is a good example of how to start doing stream processing on a http response. I was looking for a way to do exactly that, and could not find it right away by looking at the docs. Thanks saved me some trouble. – lcornea Nov 22 '13 at 19:05
Best Answer to this question. Is it possible to change block_size? – Arijoon Jun 08 '14 at 17:50
How did you came up with block_sz = 8192 ? is there an appropriate way to pick buffer size ? – Ciasto piekarz Jun 14 '14 at 04:33
Hey @PabloG ! i am having some problem in your 1st code. It can download the html file of page but not its css and js file. Please tell me how to do that. Anybody having the solution can reply. – Rahul Satal Jul 02 '15 at 12:53
@PabloG is it possible to download https contents? – R__raki__ Oct 02 '16 at 14:38
meta is a http.client.HTTPMessage object and it has no getheaders() method. Instead, change the line to: `file_size = int(dict(meta.items())['Content-Length'])` and it should work gloriously. – Harshith Thota Nov 19 '17 at 18:00
file_name = url.split('/')[-1] gives url encoded file name which is a problem if the file name has special characters (space, + etc..). Using urllib.parse.unquote gives the correct file name. from urllib.parse import unquote file_name = unquote(file_name') – jai.maruthi Jul 20 '20 at 02:37
10

According to the documentation, `urllib.request.urlretrieve` is a "legacy interface" and "might become deprecated in the future. https://docs.python.org/3/library/urllib.request.html#legacy-interface – Louis Yang Dec 24 '20 at 22:21
Exception handling is necessary: try for except `urllib.error.URLError` – Devymex Jun 17 '21 at 01:32
Does `urlretrieve` silently overwrites the local file if it exists? The documentation doesn't mention this. – Qin Heyang Mar 18 '22 at 18:26

score 556 · Accepted Answer · edited Jun 21 '20 at 15:33

556

Use urllib.request.urlopen():

import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
    html = f.read().decode('utf-8')

This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.

On Python 2, the method is in urllib2:

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

edited Jun 21 '20 at 15:33

Boris Verkhovskiy

14,854
11
100
103

answered Aug 22 '08 at 15:38

Corey

14,101
7
38
35

13

This won't work if there are spaces in the url you provide. In that case, you'll need to parse the url and urlencode the path. – Jason Sundram Apr 14 '10 at 21:17
113

Here is the Python 3 solution: http://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3 – tommy.carstensen Feb 25 '14 at 12:09
7

Just for reference. The way to urlencode the path is `urllib2.quote` – André Puel Aug 02 '14 at 02:09
11

@JasonSundram: If there are spaces in it, it isn't a URI. – Zaz Oct 01 '15 at 02:51
1

This does not work on windows with larger files. You need to read all blocks! – Avia Oct 10 '16 at 22:47
1

Or replace space with %20 – user3520245 Apr 12 '19 at 08:43
Since it currently is the highest voted answer with non-deprecated code, it might be a good idea to show how to download binary files. `.decode('utf-8')` doesn't make sense for every file on the internet. – Eric Duminil Feb 16 '23 at 07:04
This link is quite useful: https://realpython.com/python-download-file-from-url/#downloading-a-large-file-in-a-streaming-fashion – Pankaj Yadav Aug 22 '23 at 07:55

score 418 · Answer 3 · edited Mar 24 '20 at 12:30

418

In 2012, use the python requests library

>>> import requests
>>> 
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760

You can run pip install requests to get it.

Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.

2015-12-30

People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:

from tqdm import tqdm
import requests

url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)

with open("10MB", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

This is essentially the implementation @kvance described 30 months ago.

edited Mar 24 '20 at 12:30

Martijn Pieters

1,048,767
296
4,058
3,343

answered May 24 '12 at 20:08

hughdbrown

47,733
20
85
108

how do I save or extract if the zip file is actually a folder with many files in it? – Abdul Muneer Jul 05 '12 at 21:13
8

How does this handle large files, does everything get stored into memory or can this be written to a file without large memory requirement? – Bibek Shrestha Dec 17 '12 at 16:05
9

It is possible to stream large files by setting stream=True in the request. You can then call iter_content() on the response to read a chunk at a time. – kvance Jul 28 '13 at 17:14
@kvance: I did not know that. This option became known as `stream` in v1.0.0, AFAICT. It seems to have been `prefetch` in earlier versions, but I have not dug into the source code enough. Try: `git clone git@github.com:kennethreitz/requests.git && git log -S"self.stream" --source --all` – hughdbrown Aug 13 '13 at 14:10
8

Why would a url library need to have a file unzip facility? Read the file from the url, save it and then unzip it in whatever way floats your boat. Also a zip file is not a 'folder' like it shows in windows, Its a file. – Harel Nov 15 '13 at 16:36
What would be the difference between `r.content` and `r.text`? – Ali Jan 01 '16 at 08:45
2

@Ali: `r.text`: For text or unicode content. Returned as unicode. `r.content`: For binary content. Returned as bytes. Read about it here: http://docs.python-requests.org/en/latest/user/quickstart/ – hughdbrown Jan 17 '16 at 18:44
Progress bar not look nice. – mrgloom Sep 26 '18 at 12:45
6

I think a `chunk_size` argument is desirable along with `stream=True`. The default `chunk_size` is `1`, which means, each chunk could be as small as `1` byte and so is very inefficient. – haridsv Oct 01 '18 at 10:54
What does 2012 mean in Python versions? – lindhe Oct 04 '18 at 09:23
@lindhe Are you using a python version earlier than 2.7? If not, you should be able to run requests on any version you come across. – hughdbrown Oct 04 '18 at 15:58
Please consider not using requests for any serious application, it's not thead-safe and has a memory-leak issue. – ospider Mar 22 '19 at 02:55
@ospider: Do you have more info? It might still be okay if it's a one-off solution. E.g. your script expects a json in input folder. If it isn't there, download it, and forget about requests next time you run the script. – Eric Duminil Feb 16 '23 at 07:09
`for data in tqdm(response.iter_content(chunk_size=1024), unit='kB'):` will download and write the file in chunk, and show the correct unit in the progress bar. – Eric Duminil Feb 16 '23 at 07:35
This link is quite useful. https://realpython.com/python-download-file-from-url/#downloading-a-large-file-in-a-streaming-fashion – Pankaj Yadav Aug 22 '23 at 07:55

score 170 · Answer 4 · edited Mar 10 '16 at 17:14

170

import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
  output.write(mp3file.read())

The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.

edited Mar 10 '16 at 17:14

Matthew Strawbridge

19,940
10
72
93

answered Aug 22 '08 at 15:58

Grant

11,799
13
42
47

38

The disadvantage of this solution is, that the entire file is loaded into ram before saved to disk, just something to keep in mind if using this for large files on a small system like a router with limited ram. – tripplet Nov 18 '12 at 13:33
2

@tripplet so how would we fix that? – Lucas Henrique Jul 30 '15 at 15:10
11

To avoid reading the whole file into memory, try passing an argument to `file.read` that is the number of bytes to read. See: https://gist.github.com/hughdbrown/c145b8385a2afa6570e2 – hughdbrown Oct 07 '15 at 16:02
@hughdbrown I found your script useful, but have one question: can I use the file for post-processing? suppose I download a jpg file that I want to process with OpenCV, can I use the 'data' variable to keep working? or do I have to read it again from the downloaded file? – Rodrigo E. Principe Nov 16 '16 at 12:29
7

Use `shutil.copyfileobj(mp3file, output)` instead. – Mmmh mmh Nov 06 '17 at 14:20

bmaupin · Answer 5 · 2020-03-30T12:52:03.707

162

Python 3

urllib.request.urlopen

import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()

urllib.request.urlretrieve
```
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
```
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)

Python 2

urllib2.urlopen (thanks Corey)

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

urllib.urlretrieve (thanks PabloG)

import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

edited Mar 30 '20 at 12:52

answered Aug 06 '15 at 13:30

bmaupin

14,427
5
89
94

3

It sure took a while, but there, finally is the easy straightforward api I expect from a python stdlib :) – ThorSummoner Aug 04 '17 at 20:52
Very nice answer for python3, see also https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve – Edouard Thiel Dec 23 '19 at 10:31
@EdouardThiel If you click on `urllib.request.urlretrieve` above it'll bring you to that exact link. Cheers! – bmaupin Dec 23 '19 at 14:44
3

`urllib.request.urlretrieve` is documented as a "legacy interface" and "might become deprecated in the future". – gerrit Mar 27 '20 at 17:32
1

You should mention that you are getting a bunch of bytes that need to be handled after that. – thoroc Jun 14 '20 at 13:02

score 46 · Answer 6 · edited Jun 25 '20 at 19:24

46

use wget module:

import wget
wget.download('url')

edited Jun 25 '20 at 19:24

msanford

11,803
11
66
93

answered Mar 25 '15 at 12:59

Sara Santana

1,001
1
11
22

3

The repo seems to be removed. – Bahman Eslami Sep 27 '20 at 11:54
project was moved to github, but then archived by its author – Alleo Dec 12 '20 at 19:46

H S Umer farooq · Answer 7 · 2018-11-05T11:40:56.100

44

import os,requests
def download(url):
    get_response = requests.get(url,stream=True)
    file_name  = url.split("/")[-1]
    with open(file_name, 'wb') as f:
        for chunk in get_response.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)


download("https://example.com/example.jpg")

edited Nov 05 '18 at 11:40

answered Nov 05 '18 at 11:28

H S Umer farooq

981
1
8
14

1

Thanks, also, replace `with open(file_name,...` with `with open('thisname'...)` because it may throw an error – the sigmoid infinity Nov 07 '20 at 19:24

score 27 · Answer 8 · edited Aug 06 '17 at 06:32

An improved version of the PabloG code for Python 2/3:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )

import sys, os, tempfile, logging

if sys.version_info >= (3,):
    import urllib.request as urllib2
    import urllib.parse as urlparse
else:
    import urllib2
    import urlparse

def download_file(url, dest=None):
    """ 
    Download and save a file specified by url to dest directory,
    """
    u = urllib2.urlopen(url)

    scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
    filename = os.path.basename(path)
    if not filename:
        filename = 'downloaded.file'
    if dest:
        filename = os.path.join(dest, filename)

    with open(filename, 'wb') as f:
        meta = u.info()
        meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
        meta_length = meta_func("Content-Length")
        file_size = None
        if meta_length:
            file_size = int(meta_length[0])
        print("Downloading: {0} Bytes: {1}".format(url, file_size))

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += len(buffer)
            f.write(buffer)

            status = "{0:16}".format(file_size_dl)
            if file_size:
                status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
            status += chr(13)
            print(status, end="")
        print()

    return filename

if __name__ == "__main__":  # Only run if this file is called directly
    print("Testing with 10MB download")
    url = "http://download.thinkbroadband.com/10MB.zip"
    filename = download_file(url)
    print(filename)

I would remove the parentheses from the first line, because it is not too old feature. — Arpad Horvath -- Слава Україні, May 30 '13 at 19:37

Akif · Answer 9 · 2018-01-16T12:05:10.550

23

Simple yet Python 2 & Python 3 compatible way comes with six library:

from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

edited Jan 16 '18 at 12:05

answered Jun 22 '17 at 07:59

Akif

6,018
3
41
44

1

This is the best way to do it for 2+3 compatibility. – Fush Dec 27 '17 at 06:13

score 21 · Answer 10 · answered Sep 19 '16 at 12:45

Following are the most commonly used calls for downloading files in python:

urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)

Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.

score 20 · Answer 11 · answered Sep 25 '13 at 17:55

20

Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.

answered Sep 25 '13 at 17:55

anatoly techtonik

19,847
9
124
140

3

No option to save with custom filename ? – Alex May 21 '14 at 15:29
2

@Alex added -o FILENAME option to version 2.1 – anatoly techtonik Jul 10 '14 at 11:04
The progress bar does not appear when I use this module under Cygwin. – Joe Coder May 06 '15 at 07:40
You should change from `-o` to `-O` to avoid confusion, as it is in GNU wget. Or at least both options should be valid. – erik Jul 17 '15 at 15:46
@eric I am not sure that I want to make `wget.py` an in-place replacement for real `wget`. The `-o` already behaves differently - it is compatible with `curl` this way. Would a note in documentation help to resolve the issue? Or it is the essential feature for an utility with such name to be command line compatible? – anatoly techtonik Jul 17 '15 at 20:24
@anatolytechtonik How can i save the file downloaded using wget?. Also how to turn off verification of SSL certificates? – Gautam Krishna R Dec 24 '16 at 07:09
@GautamKrishnaR `wget.py` saves file automatically. As for SSL checks, they are done by Python, so look how to disable it there or ask another question. – anatoly techtonik Dec 25 '16 at 10:19
Is the library been supported in 2023? There is no homepage available on https://pypi.python.org/pypi/wget. – DarkSidds Jun 06 '23 at 09:44

Apurv Agarwal · Answer 12 · 2018-02-08T17:37:15.140

In python3 you can use urllib3 and shutil libraires. Download them by using pip or pip3 (Depending whether python3 is default or not)

pip3 install urllib3 shutil

Then run this code

import urllib.request
import shutil

url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

Note that you download urllib3 but use urllib in code

score 15 · Answer 13 · answered Aug 22 '08 at 15:58

I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:

import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()

Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:

import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

score 10 · Answer 14 · answered Nov 19 '15 at 23:48

If you have wget installed, you can use parallel_sync.

pip install parallel_sync

from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

Doc: https://pythonhosted.org/parallel_sync/pages/examples.html

This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.

score 9 · Answer 15 · answered Jan 26 '14 at 13:12

You can get the progress feedback with urlretrieve as well:

def report(blocknr, blocksize, size):
    current = blocknr*blocksize
    sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))

def downloadFile(url):
    print "\n",url
    fname = url.split('/')[-1]
    print fname
    urllib.urlretrieve(url, fname, report)

Robin Dinse · Answer 16 · 2018-12-10T14:07:06.710

Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.

import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])

In Jupyter Notebook, one can also call programs directly with the ! syntax:

!wget -O example_output_file.html https://example.com

score 6 · Answer 17 · edited Oct 17 '22 at 12:52

6

Use Python Requests in 5 lines

import requests as req

remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'

data = req.get(remote_url)

# Save file data to local copy
with open(local_file_name, 'wb')as file:
    file.write(data.content)

Now do something with the local copy of the remote file

edited Oct 17 '22 at 12:52

Jieqin

588
5
21

answered May 16 '22 at 06:43

Thai Boy

141
2
5

geometrian · Answer 18 · 2017-05-13T21:52:30.237

I wrote the following, which works in vanilla Python 2 or Python 3.

import sys
try:
    import urllib.request
    python3 = True
except ImportError:
    import urllib2
    python3 = False


def progress_callback_simple(downloaded,total):
    sys.stdout.write(
        "\r" +
        (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
        " [%3.2f%%]"%(100.0*float(downloaded)/float(total))
    )
    sys.stdout.flush()

def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
    def _download_helper(response, out_file, file_size):
        if progress_callback!=None: progress_callback(0,file_size)
        if block_size == None:
            buffer = response.read()
            out_file.write(buffer)

            if progress_callback!=None: progress_callback(file_size,file_size)
        else:
            file_size_dl = 0
            while True:
                buffer = response.read(block_size)
                if not buffer: break

                file_size_dl += len(buffer)
                out_file.write(buffer)

                if progress_callback!=None: progress_callback(file_size_dl,file_size)
    with open(dstfilepath,"wb") as out_file:
        if python3:
            with urllib.request.urlopen(srcurl) as response:
                file_size = int(response.getheader("Content-Length"))
                _download_helper(response,out_file,file_size)
        else:
            response = urllib2.urlopen(srcurl)
            meta = response.info()
            file_size = int(meta.getheaders("Content-Length")[0])
            _download_helper(response,out_file,file_size)

import traceback
try:
    download(
        "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
        "output.zip",
        progress_callback_simple
    )
except:
    traceback.print_exc()
    input()

Notes:

Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.

works great, run it through jupyter got what i want :-) – Samir Ouldsaadi Sep 26 '18 at 07:14 — Samir Ouldsaadi, Sep 26 '18 at 07:14

score 5 · Answer 19 · answered Nov 03 '17 at 14:25

If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.

First, these are the results (they are similar in different runs):

$ python wget_test.py 
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============

The way I performed the test is using "profile" decorator. This is the full code:

import wget
import urllib
import time
from functools import wraps

def profile(func):
    @wraps(func)
    def inner(*args):
        print func.__name__, ": starting"
        start = time.time()
        ret = func(*args)
        end = time.time()
        print func.__name__, ": {:.2f}".format(end - start)
        return ret
    return inner

url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'

def do_nothing(*args):
    pass

@profile
def urlretrive_test(url):
    return urllib.urlretrieve(url)

@profile
def wget_no_bar_test(url):
    return wget.download(url, out='/tmp/', bar=do_nothing)

@profile
def wget_with_bar_test(url):
    return wget.download(url, out='/tmp/')

urlretrive_test(url1)
print '=============='
time.sleep(1)

wget_no_bar_test(url2)
print '=============='
time.sleep(1)

wget_with_bar_test(url3)
print '=============='
time.sleep(1)

urllib seems to be the fastest

There must be something completely horrible going on under the hood to make the bar increase the time so much. — Alistair Carscadden, Sep 10 '18 at 06:39

score 5 · Answer 20 · answered Feb 24 '20 at 07:12

5

Late answer, but for python>=3.6 you can use:

import dload
dload.save(url)

Install dload with:

pip3 install dload

answered Feb 24 '20 at 07:12

Pedro Lobito

94,083
31
258
268

Can I ask - where does the file save once the program runs? Also, is there a way to name it and save it in a specific location? This is the link I am working with - when you click the link it immediately downloads an excel file: https://www.ons.gov.uk/generator?format=xls&uri=/economy/inflationandpriceindices/timeseries/chaw/mm23 – Joshua Tinashe Oct 14 '20 at 13:03
You can supply the save location as second argument, e.g.: `dload.save(url, "/home/user/test.xls")` – Pedro Lobito Oct 14 '20 at 15:42

score 4 · Answer 21 · edited Nov 26 '13 at 14:25

4

Source code can be:

import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()                            
sock.close()                                        
print htmlSource

edited Nov 26 '13 at 14:25

Mailerdaimon

6,003
3
35
46

answered Nov 26 '13 at 13:21

Zuko

2,764
30
30

score 4 · Answer 22 · edited Jan 02 '19 at 12:02

4

You can use PycURL on Python 2 and 3.

import pycurl

FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'

with open(FILE_DEST, 'wb') as f:
    c = pycurl.Curl()
    c.setopt(c.URL, FILE_SRC)
    c.setopt(c.WRITEDATA, f)
    c.perform()
    c.close()

edited Jan 02 '19 at 12:02

Guillaume Jacquenot

11,217
6
43
49

answered Aug 08 '18 at 03:51

gzerone

2,179
21
24

Ankush Rathour · Answer 23 · 2023-05-29T04:34:05.977

You can use python requests

import os
import requests


outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
  output.write(response.content)

You can use shutil

import os
import requests
import shutil
 
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
  shutil.copyfileobj(response.content, f)

If you are downloading from restricted url, don't forget to include access token in headers

JD3 · Answer 24 · 2017-05-16T16:46:21.830

3

This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :

    import urllib2,os

    url = "http://download.thinkbroadband.com/10MB.zip"

    file_name = url.split('/')[-1]
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (file_name, file_size)
    os.system('cls')
    file_size_dl = 0
    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        file_size_dl += len(buffer)
        f.write(buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status = status + chr(8)*(len(status)+1)
        print status,

    f.close()

If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.

edited May 16 '17 at 16:46

answered Oct 14 '13 at 02:54

JD3

410
1
5
14

3

`cls` doesn't do anything on my OS X or nor on an Ubuntu server of mine. Some clarification could be good. – the Sep 24 '14 at 21:57
I think you should use `clear` for linux, or even better replace the print line instead of clearing the whole command line output. – Arijoon Jan 21 '15 at 01:01
5

this answer just copies another answer and adds a call to a deprecated function (`os.system()`) that launches a subprocess to clear the screen using a platform specific command (`cls`). How does this have *any* upvotes?? Utterly worthless "answer" IMHO. – Corey Goldberg Dec 11 '15 at 19:56

Sphynx-HenryAY · Answer 25 · 2019-05-03T13:00:17.847

urlretrieve and requests.get are simple, however the reality not. I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package

import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)

#remember to open file in bytes mode
with open(filename, 'wb') as f:
    while True:
        buffer = url_connect.read(buffer_size)
        if not buffer: break

        #an integer value of size of written data
        data_wrote = f.write(buffer)

#you could probably use with-open-as manner
url_connect.close()

This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.

score 3 · Answer 26 · answered Jul 02 '21 at 12:00

3

New Api urllib3 based implementation

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
   200
>>> r.data
   *****Response Data****

More info: https://pypi.org/project/urllib3/

answered Jul 02 '21 at 12:00

Ninja Master

126
3

gibbone · Answer 27 · 2020-03-09T15:31:28.483

I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.

After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.

It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.

Here it is:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup

# --- insert Stan's script here ---
# if sys.version_info >= (3,): 
#...
#...
# def download_file(url, dest=None): 
#...
#...

# --- new stuff ---
def collect_all_url(page_url, extensions):
    """
    Recovers all links in page_url checking for all the desired extensions
    """
    conn = urllib2.urlopen(page_url)
    html = conn.read()
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('a')

    results = []    
    for tag in links:
        link = tag.get('href', None)
        if link is not None: 
            for e in extensions:
                if e in link:
                    # Fallback for badly defined links
                    # checks for missing scheme or netloc
                    if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
                        results.append(link)
                    else:
                        new_url=urlparse.urljoin(page_url,link)                        
                        results.append(new_url)
    return results

if __name__ == "__main__":  # Only run if this file is called directly
    # Command line arguments
    parser = argparse.ArgumentParser(
        description='Download all files from a webpage.')
    parser.add_argument(
        '-u', '--url', 
        help='Page url to request')
    parser.add_argument(
        '-e', '--ext', 
        nargs='+',
        help='Extension(s) to find')    
    parser.add_argument(
        '-d', '--dest', 
        default=None,
        help='Destination where to save the files')
    parser.add_argument(
        '-p', '--par', 
        action='store_true', default=False, 
        help="Turns on parallel download")
    args = parser.parse_args()

    # Recover files to download
    all_links = collect_all_url(args.url, args.ext)

    # Download
    if not args.par:
        for l in all_links:
            try:
                filename = download_file(l, args.dest)
                print(l)
            except Exception as e:
                print("Error while downloading: {}".format(e))
    else:
        from multiprocessing.pool import ThreadPool
        results = ThreadPool(10).imap_unordered(
            lambda x: download_file(x, args.dest), all_links)
        for p in results:
            print(p)

An example of its usage is:

python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>

And an actual example if you want to see it in action:

python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics

score 0 · Answer 28 · answered Oct 17 '22 at 13:19

Another possibility is with built-in http.client:

from http import HTTPStatus, client
from shutil import copyfileobj

# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
    if response.status == HTTPStatus.OK:
        copyfileobj(response, open("noise.mp3")
    else:
        raise Exception("request needs work")

The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.

score 0 · Answer 29 · answered Nov 15 '22 at 12:33

You can use keras.utils.get_file to do it:

from tensorflow import keras

path_to_downloaded_file = keras.utils.get_file(
    fname="file name",
    origin="https://www.linktofile.com/link/to/file",
    extract=True,
    archive_format="zip",  # downloaded file format
    cache_dir="/",  # cache and extract in current directory
)

score -3 · Answer 30 · answered Sep 14 '20 at 00:20

Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table. Put curl.exe in the same directory as your script

from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])

Note: You cannot specify an output path with curl, so do an os.rename afterwards

How to download a file over HTTP?

30 Answers30

Python 3

Python 2

Linked

Related