114

I'm trying to create a Python function that does the same thing as this wget command:

wget -c --read-timeout=5 --tries=0 "$URL"

-c - Continue from where you left off if the download is interrupted.

--read-timeout=5 - If there is no new data coming in for over 5 seconds, give up and try again. Given -c this mean it will try again from where it left off.

--tries=0 - Retry forever.

Those three arguments used in tandem results in a download that cannot fail.

I want to duplicate those features in my Python script, but I don't know where to begin...

Soviero
  • 1,774
  • 2
  • 13
  • 23
  • 3
    Well, no, the download *can* fail for many reasons but yeah. Have you looked into the [requests](http://docs.python-requests.org/en/latest/) module? – Iguananaut Jun 21 '14 at 23:50
  • @Iguananaut It should be noted that downloads can be interrupted with Ctrl+c on purpose, with the command-line wget tool, anyway (I believe this is the standard way to pause them in wget, using `wgetb -c the_URL` to resume). See https://ubuntuforums.org/showthread.php?t=991864 – Brōtsyorfuzthrāx Sep 16 '18 at 04:38

10 Answers10

149

There is also a nice Python module named wget that is pretty easy to use. Keep in mind that the package has not been updated since 2015 and has not implemented a number of important features, so it may be better to use other methods. It depends entirely on your use case. For simple downloading, this module is the ticket. If you need to do more, there are other solutions out there.

>>> import wget
>>> url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
>>> filename = wget.download(url)
100% [................................................] 3841532 / 3841532>
>> filename
'razorback.mp3'

Enjoy.

However, if wget doesn't work (I've had trouble with certain PDF files), try this solution.

Edit: You can also use the out parameter to use a custom output directory instead of current working directory.

>>> output_directory = <directory_name>
>>> filename = wget.download(url, out=output_directory)
>>> filename
'razorback.mp3'
Blairg23
  • 11,334
  • 6
  • 72
  • 72
  • yes I have posted here https://stackoverflow.com/questions/45747913/facing-issue-with-wget-in-python?noredirect=1#comment78456827_45747913 – Ashish Karpe Aug 18 '17 at 11:35
  • 4
    Sorry for late reply, didn't see this notification for some reason. You need to `pip install wget` most likely. – Blairg23 Mar 11 '18 at 20:13
  • 3
    @AshishKarpe If you're on Ubuntu, try sudo apt-get install python3-wget. – Brōtsyorfuzthrāx Sep 16 '18 at 04:32
  • @Blairg32, The question is wanting to know how to continue a download (not how to start one, as shown in your answer). The Python wget documentation doesn't seem to say anything about continuing downloads. If you know something about it, please say (maybe it's automatic in the module). – Brōtsyorfuzthrāx Sep 16 '18 at 04:32
  • 1
    @Shule That's a really good point that I hadn't even noticed until you brought it up. I haven't played with the continue parameter at all with this `wget` Python module, but here is the source if you want to check it out: https://bitbucket.org/techtonik/python-wget – Blairg23 Sep 16 '18 at 12:28
  • 2
    `wget` comes with very few options and doesn't seem to be maintained. `requests` is superior in every way. – imrek Sep 05 '19 at 05:13
  • @DrunkenMaster This is the only use case I would argue with you. `wget` offers a very simple function: `wget`. It offers a nice interface that acts exactly like the Linux command and does that one thing very well. It probably doesn't require much maintenance since it's such a simple function. And if you're worried about it, can simply fork it :) – Blairg23 Sep 06 '19 at 07:20
  • 2
    @Blairg23 Meanwhile the python wget package explicitly says, it's not options compatible with the original `wget` utility. FYI, you can't even set the User-Agent header, can you? – imrek Sep 06 '19 at 10:38
  • 1
    For anyone willing to use it as a replacement as wget with all the options it won't work. The only option supported by this is `out` and it's also abandoned. https://stackoverflow.com/a/51812486/8608146 is a better answer than this – Phani Rithvij Sep 06 '19 at 16:04
  • Agree to disagree. The zen of Python would state `simple > complex` and this fits well in that. Just 3 lines of code and an import. – Blairg23 Sep 06 '19 at 16:46
  • Sorry for the lazyness - probably is written somewhere - is the wget.download command blocking? i.e. - would >> for - wget.download - be downloading in parallel or should I do something extra for that? – Ori5678 Mar 01 '22 at 11:04
42

urllib.request should work. Just set it up in a while(not done) loop, check if a localfile already exists, if it does send a GET with a RANGE header, specifying how far you got in downloading the localfile. Be sure to use read() to append to the localfile until an error occurs.

This is also potentially a duplicate of Python urllib2 resume download doesn't work when network reconnects

Community
  • 1
  • 1
Eugene K
  • 3,381
  • 2
  • 23
  • 36
  • When I try `urllib.request.urlopen` or `urllib.request.Request` with a string containting the url as the url argument, I get `ValueError: unknown url type` – Ecko Nov 18 '15 at 16:55
  • 2
    @XamuelDvorak Are you actually entering a URL? A url requires the type, e.g. `http://`, `ftp://`. – Eugene K Dec 01 '15 at 22:48
  • I was using 'stackoverflow.com', which, in my browser, has nothing of that sort in front of it. – Ecko Dec 13 '15 at 19:55
  • It shows that for other websites though. I'll try your solution – Ecko Dec 14 '15 at 21:52
31

I had to do something like this on a version of linux that didn't have the right options compiled into wget. This example is for downloading the memory analysis tool 'guppy'. I'm not sure if it's important or not, but I kept the target file's name the same as the url target name...

Here's what I came up with:

python -c "import requests; r = requests.get('https://pypi.python.org/packages/source/g/guppy/guppy-0.1.10.tar.gz') ; open('guppy-0.1.10.tar.gz' , 'wb').write(r.content)"

That's the one-liner, here's it a little more readable:

import requests
fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

This worked for downloading a tarball. I was able to extract the package and download it after downloading.

EDIT:

To address a question, here is an implementation with a progress bar printed to STDOUT. There is probably a more portable way to do this without the clint package, but this was tested on my machine and works fine:

#!/usr/bin/env python

from clint.textui import progress
import requests

fname = 'guppy-0.1.10.tar.gz'
url = 'https://pypi.python.org/packages/source/g/guppy/' + fname

r = requests.get(url, stream=True)
with open(fname, 'wb') as f:
    total_length = int(r.headers.get('content-length'))
    for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): 
        if chunk:
            f.write(chunk)
            f.flush()
Will Charlton
  • 862
  • 10
  • 11
  • `python -m ensurepip && python -m pip install --upgrade pip && python -c "import os, sys, pip._vendor.requests; rq=pip._vendor.requests.get(sys.argv[1], stream=True); open(os.path.basename(sys.argv[1]), 'wb').write(rq.content)" http://httpbin.org/encoding/utf8 && diff -u ./utf8 <(curl -fSsl http://httpbin.org/encoding/utf8); cat ./utf8` # Python > ensurepip docs: https://docs.python.org/3/library/ensurepip.html#command-line-interface # Pip docs > Vendoring policy: https://pip.pypa.io/en/stable/development/vendoring-policy/ – Wes Turner Jan 26 '23 at 02:47
  • Though does `os.basename` fail with forward-slash URL ~paths on windows? `test "brotli" == "$(python -c "import sys, urllib.parse; print(urllib.parse.urlparse(sys.argv[1]).path.rsplit('/',1)[-1])" "https://httpbin.org/path/to/brotli#fragment")"` – Wes Turner Jan 26 '23 at 03:01
31

A solution that I often find simpler and more robust is to simply execute a terminal command within python. In your case:

import os
url = 'https://www.someurl.com'
os.system(f"""wget -c --read-timeout=5 --tries=0 "{url}"""")
Yohan Obadia
  • 2,552
  • 2
  • 24
  • 31
  • 11
    When I get a downvote, especially for providing a completely different approach, I like to know why. Care to explain ? – Yohan Obadia May 15 '19 at 14:29
  • 2
    It looks like the args for os.system are improperly escaped. One " too many at the end. Additionally, it doesn't work on windows because it has no wget. For that you need to go here: https://eternallybored.org/misc/wget/ download it and add it to the environment (PATH). Good solution though, upvoting ;) – Abel Dantas Jun 10 '19 at 17:07
  • Thanks for your feedbacks :) – Yohan Obadia Jun 12 '19 at 14:31
  • 7
    Use `subprocess`. **ALWAYS** use `subprocess`. Trivially easy to pown a machine that uses `os.system` like this for remote user input. – Antti Haapala -- Слава Україні Jul 28 '20 at 06:10
  • `subprocess` without `shell=True` (or something that wraps subprocess like sarge; which does quoting to prevent shell escape vulnerabilities) – Wes Turner Jan 26 '23 at 02:51
20
import urllib2
import time

max_attempts = 80
attempts = 0
sleeptime = 10 #in seconds, no reason to continuously try if network is down

#while true: #Possibly Dangerous
while attempts < max_attempts:
    time.sleep(sleeptime)
    try:
        response = urllib2.urlopen("http://example.com", timeout = 5)
        content = response.read()
        f = open( "local/index.html", 'w' )
        f.write( content )
        f.close()
        break
    except urllib2.URLError as e:
        attempts += 1
        print type(e)
Pujan
  • 3,154
  • 3
  • 38
  • 52
15

For Windows and Python 3.x, my two cents contribution about renaming the file on download :

  1. Install wget module : pip install wget
  2. Use wget :
import wget
wget.download('Url', 'C:\\PathToMyDownloadFolder\\NewFileName.extension')

Truely working command line example :

python -c "import wget; wget.download(""https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz"", ""C:\\Users\\TestName.TestExtension"")"

Note : 'C:\\PathToMyDownloadFolder\\NewFileName.extension' is not mandatory. By default, the file is not renamed, and the download folder is your local path.

Paul Denoyes
  • 305
  • 3
  • 8
  • just my 2 cents. For users on windows with Anaconda, it looks like that there is no wget package. In this case, I would suggest using the requests lib that is more portable. Have a look at this [link](https://anaconda.org/search?q=wget). There are a few packages for win-64 but none from *official* sources. – toto Jun 29 '22 at 04:08
7

Here's the code adopted from the torchvision library:

import urllib

def download_url(url, root, filename=None):
    """Download a file from a url and place it in root.
    Args:
        url (str): URL to download file from
        root (str): Directory to place downloaded file in
        filename (str, optional): Name to save the file under. If None, use the basename of the URL
    """

    root = os.path.expanduser(root)
    if not filename:
        filename = os.path.basename(url)
    fpath = os.path.join(root, filename)

    os.makedirs(root, exist_ok=True)

    try:
        print('Downloading ' + url + ' to ' + fpath)
        urllib.request.urlretrieve(url, fpath)
    except (urllib.error.URLError, IOError) as e:
        if url[:5] == 'https':
            url = url.replace('https:', 'http:')
            print('Failed download. Trying https -> http instead.'
                    ' Downloading ' + url + ' to ' + fpath)
            urllib.request.urlretrieve(url, fpath)

If you are ok to take dependency on torchvision library then you also also simply do:

from torchvision.datasets.utils import download_url
download_url('http://something.com/file.zip', '~/my_folder`)
Shital Shah
  • 63,284
  • 17
  • 238
  • 185
1

Let me Improve a example with threads in case you want download many files.

import math
import random
import threading

import requests
from clint.textui import progress

# You must define a proxy list
# I suggests https://free-proxy-list.net/
proxies = {
    0: {'http': 'http://34.208.47.183:80'},
    1: {'http': 'http://40.69.191.149:3128'},
    2: {'http': 'http://104.154.205.214:1080'},
    3: {'http': 'http://52.11.190.64:3128'}
}


# you must define the list for files do you want download
videos = [
    "https://i.stack.imgur.com/g2BHi.jpg",
    "https://i.stack.imgur.com/NURaP.jpg"
]

downloaderses = list()


def downloaders(video, selected_proxy):
    print("Downloading file named {} by proxy {}...".format(video, selected_proxy))
    r = requests.get(video, stream=True, proxies=selected_proxy)
    nombre_video = video.split("/")[3]
    with open(nombre_video, 'wb') as f:
        total_length = int(r.headers.get('content-length'))
        for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
            if chunk:
                f.write(chunk)
                f.flush()


for video in videos:
    selected_proxy = proxies[math.floor(random.random() * len(proxies))]
    t = threading.Thread(target=downloaders, args=(video, selected_proxy))
    downloaderses.append(t)

for _downloaders in downloaderses:
    _downloaders.start()
Egalicia
  • 683
  • 9
  • 17
  • This does none of the things OP asked for (and several things they didn't ask for). – melpomene Jul 19 '17 at 06:23
  • 1
    The example try to show wget multi download feature – Egalicia Jul 19 '17 at 06:31
  • No one asked for that. OP asked for the equivalent of `-c`, `--read-timeout=5`, and `--tries=0` (with a single URL). – melpomene Jul 19 '17 at 06:33
  • 1
    I understand, sorry :( – Egalicia Jul 19 '17 at 06:34
  • 1
    I'm really glad to see it here, serendipity being the cornerstone of the internet. I might add whilst here though, that during my research I came across this for multithreading and the requests library: requests-threaded https://github.com/requests/requests-threads – miller the gorilla Nov 03 '18 at 07:55
  • correction - https://pypi.org/project/txrequests/ txrequests seems to be the best solution, albeit for a different problem to the one posed by this thread. – miller the gorilla Nov 03 '18 at 08:15
1

easy as py:

class Downloder():
    def download_manager(self, url, destination='Files/DownloderApp/', try_number="10", time_out="60"):
        #threading.Thread(target=self._wget_dl, args=(url, destination, try_number, time_out, log_file)).start()
        if self._wget_dl(url, destination, try_number, time_out, log_file) == 0:
            return True
        else:
            return False


    def _wget_dl(self,url, destination, try_number, time_out):
        import subprocess
        command=["wget", "-c", "-P", destination, "-t", try_number, "-T", time_out , url]
        try:
            download_state=subprocess.call(command)
        except Exception as e:
            print(e)
        #if download_state==0 => successfull download
        return download_state
pd shah
  • 1,346
  • 2
  • 14
  • 26
-6

TensorFlow makes life easier. file path gives us the location of downloaded file.

import tensorflow as tf
tf.keras.utils.get_file(origin='https://storage.googleapis.com/tf-datasets/titanic/train.csv',
                                    fname='train.csv',
                                    untar=False, extract=False)
Rajan saha Raju
  • 794
  • 7
  • 13