Download files given their url and store the filename same as in content disposition

Question

I got this code from How to download a file using python in a 'smarter' way??

But it throws an error:

   in download
   r.close()
   UnboundLocalError: local variable 'r' referenced before assignment

Also I would like to add a condition that the file to be downloaded should be pdf only.

import urllib2
import shutil
import urlparse
import os


def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
         # if no filename was found above, parse it out of the final URL.
    return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    req = urllib2.Request(url)
    try:
        r = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
            print e.fp.read()
    try:
            fileName = fileName or getFileName(url,r)
            with open(fileName, 'wb') as f:
                 shutil.copyfileobj(r,f)
    finally:
            r.close()

download('http://www.altria.com/Documents/Altria_10Q_Filed10242013.pdf#?page=24')

This works completely fine with url : http://www.gao.gov/new.items/d04641.pdf So my question is why doesn't it work for some urls but works completely fine with urls like the one mentioned above.

You have described your problem, and you have included a sample program. That's good. You are still missing the key ingredient of a SO post: a question. SO is a question-and-answer site. Readers such as yourself ask questions and other readers attempt to answer them. What is your question? — Robᵩ, Jan 10 '14 at 20:21

score 0 · Answer 1 · answered Jan 10 '14 at 20:29

0

This is a scope issue.

At the beginning of your function, define:

    r=None

Then, instead of calling r.close(), do the following:

    if r:
      r.close()

answered Jan 10 '14 at 20:29

P B

85
4

Can whoever down-voted please comment? The original code is attempting to call close() on r in the "finally" block even though it may not have been initialized. What is wrong with making sure it has been? The original poster was attempting to handle errors gracefully and this solves that issue if he/she attempts to access an inaccessible URL. – P B Jan 10 '14 at 22:34
I didn't downvote it, but what you suggested will end with silent failure. The question at the end wasn't about how to hush the error, it was about why the error was happening. That said, your answer certainly wasn't worth a down vote. Have some rep. – nmichaels Jan 13 '14 at 16:15

score 0 · Answer 2 · answered Jan 10 '14 at 20:31

What's happening is that the first exception is getting caught: except urllib2.HTTPError but the code continues, even though r is not defined (because the exception occurred)

I think you want to use the else clause in your try/except block to only execute the rest of the code if r = urllib2.urlopen(req) succeeded:

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        # if no filename was found above, parse it out of the final URL.
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    req = urllib2.Request(url)
    try:
        r = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print e.fp.read()
    else:
        try:
            fileName = fileName or getFileName(url,r)
            with open(fileName, 'wb') as f:
                 shutil.copyfileobj(r,f)
        finally:
            r.close()

nmichaels · Answer 3 · 2014-01-13T16:16:28.973

-1

I assume it prints out an error message talking about how urllib2.urlopen(req) failed before it gives you that unbound local error. If it does, add raise on the line after print e.fp.read() and your problem will look different.

edited Jan 13 '14 at 16:16

answered Jan 10 '14 at 20:24

nmichaels

49,466
12
107
135

It says: urllib2.HTTPError: HTTP Error 403: Forbidden – blackmamba Jan 10 '14 at 20:27
Well there's your problem. The server is denying your script access to that page. – nmichaels Jan 10 '14 at 20:28

Download files given their url and store the filename same as in content disposition

3 Answers3