56

urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.

Can anybody provide some information about this problem?

btw0
  • 3,516
  • 5
  • 34
  • 36

8 Answers8

27

Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.

You can detect any HTTP errors easily:

>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found

resp is actually HTTPResponse object that you can do a lot of useful things with:

>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
Alexander Lebedev
  • 5,968
  • 1
  • 20
  • 30
  • 6
    Can urllib2 provide the caching behavior of urlretrieve though? Or would we have to reimplement it? – Kiv Jun 12 '09 at 22:45
  • See this awersome recipie from ActiveState: http://code.activestate.com/recipes/491261/ We're using it in our current project, works flawlessly – Alexander Lebedev Jun 18 '09 at 05:50
  • 1
    urlopen does not provide a hook function (to show progress bar for example) like urlretrieve. – Sridhar Ratnakumar Aug 20 '09 at 20:05
  • 1
    You can hook your own function: fp = open(local, 'wb') totalSize = int(h["Content-Length"]) blockSize = 8192 # same value as in urllib.urlretrieve count = 0 while True: chunk = resp.read(blockSize) if not chunk: break fp.write(chunk) count += 1 dlProgress(count, blockSize, totalSize) # The hook! fp.flush() fp.close() – Cees Timmerman Mar 16 '12 at 15:49
7

I keep it simple:

# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.

import urllib2

remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"

u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])

print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')

blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
    chunk = u.read(blockSize)
    if not chunk: break
    fp.write(chunk)
    count += 1
    if totalSize > 0:
        percent = int(count * blockSize * 100 / totalSize)
        if percent > 100: percent = 100
        print "%2d%%" % percent,
        if percent < 100:
            print "\b\b\b\b\b",  # Erase "NN% "
        else:
            print "Done."

fp.flush()
fp.close()
if not totalSize:
    print
Cees Timmerman
  • 17,623
  • 11
  • 91
  • 124
5

According to the documentation is is undocumented

to get access to the message it looks like you do something like:

a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')

b is the message instance

Since I have learned that Python it is always useful to use Python's ability to be introspective when I type

dir(b) 

I see lots of methods or functions to play with

And then I started doing things with b

for example

b.items()

Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.

Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.

Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:

needparam=[]
for each in dir(b):
    x='b.'+each+'()'
    try:
        eval(x)
        print x
    except:
        needparam.append(x)
PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
2

You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:

How to catch 404 error in urllib.urlretrieve

Community
  • 1
  • 1
Christian Davén
  • 16,713
  • 12
  • 64
  • 77
1

I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.

import tempfile
import pycurl
import os

def get_filename_parts_from_url(url):
    fullname = url.split('/')[-1].split('#')[0].split('?')[0]
    t = list(os.path.splitext(fullname))
    if t[1]:
        t[1] = t[1][1:]
    return t

def retrieve(url, filename=None):
    if not filename:
        garbage, suffix = get_filename_parts_from_url(url)
        f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
        filename = f.name
    else:
        f = open(filename, 'wb')
    c = pycurl.Curl()
    c.setopt(pycurl.URL, str(url))
    c.setopt(pycurl.WRITEFUNCTION, f.write)
    try:
        c.perform()
    except:
        filename = None
    finally:
        c.close()
        f.close()
    return filename
btw0
  • 3,516
  • 5
  • 34
  • 36
0
class MyURLopener(urllib.FancyURLopener):
    http_error_default = urllib.URLopener.http_error_default

url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
    pass
    ...

try:
    (f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
    print e
0

:) My first post on StackOverflow, have been a lurker for years. :)

Sadly dir(urllib.urlretrieve) is deficient in useful information. So from this thread thus far I tried writing this:

a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b

which produced this:

A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close

I guess one can check:

if b.Content-Length > 0:

My next step is to test a scenario where the retrieve fails...

fotonix
  • 166
  • 1
  • 7
0

Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:

A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg

A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html

In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.

fotonix
  • 166
  • 1
  • 7