1

I am using below codes to download online PDF files. It works fine for most files.

# -*- coding: utf8 -*-

import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split('')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()

however for some files with special characters in the address, for example:

download(u'http://www.poemhunter.com/i/ebooks/pdf/aogán_ó_rathaille_2012_5.pdf', 'c:\\the_file.pdf')

it give a Unicode error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 21: ordinal not in range(128)

How can I solve this problem? Thanks.

Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    BTW the `# -*- coding: utf8 -*-` line just tells the Python interpreter to handle UTF-8 in your source code file, it has no affect on how your program itself actually processes UTF-8 / Unicode. – PM 2Ring Dec 15 '14 at 08:55

2 Answers2

2

[I guess this counts as an answer, since it shows an alternative way to handle the URL encoding problem. But I mostly wrote it in response to Mark K's comment in dazedconfused's answer.]

Maybe Acrobat's just being too strict; try another PDF tool.

I just downloaded that PDF using this code in Python 2.6.4 on Puppy Linux (Lupu 5.25):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
import urlparse

old_URL = u'http://www.poemhunter.com/i/ebooks/pdf/aogán_ó_rathaille_2012_5.pdf'

url_parts = urlparse.urlparse(old_URL)
url_parts = [urllib.quote(s.encode('utf-8')) for s in url_parts]
new_URL = urlparse.urlunparse(url_parts)
print new_URL

urllib.urlretrieve(new_URL, 'test.pdf') 

The PDF file looks ok to me, though

My PDF reader, epdfview, complains:

(epdfview:10632): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

but it seems to display the file ok.

This is what pdfinfo says:

Title:          Aogán à Rathaille - poems - 
Creator:        PoemHunter.Com
Producer:       PoemHunter.Com
CreationDate:   Wed May 23 00:44:47 2012
Tagged:         no
Pages:          7
Encrypted:      yes (print:yes copy:no change:no addNotes:no)
Page size:      612 x 792 pts (letter)
File size:      50469 bytes
Optimized:      no
PDF version:    1.3

I also downloaded it via my browser (Seamonkey 2.31), and as expected it's identical to the file retrieved via Python.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Nice answer. Should use this way to retrieve the file; just check that if I use `r = urllib2.urlopen(urllib2.Request(url.encode('utf-8'))` to open this url, `r` will return a file object containing plain html text instead of binary data; though it works fine on some other urls. – dazedconfused Dec 15 '14 at 09:39
  • @dazedconfused: Interesting! I haven't noticed that before, but I haven't used `urllib2.Request` myself for a couple of years. You might like to investigate the [requests](http://docs.python-requests.org/en/latest/) package; it makes complicated URL requests much simpler than using than standard modules. Of course, it does have the disadvantage that it's not (yet) a standard module, but it is very popular & easy to install using [pip](https://pip.pypa.io/en/latest/). – PM 2Ring Dec 15 '14 at 10:35
  • Just tried `requests`; it works fine and the PDF file can be opened without any problem after being downloaded; still not sure what causes the problem. – dazedconfused Dec 15 '14 at 10:52
  • @PM 2Ring, marvelous! Thanks for the help. – Mark K Dec 16 '14 at 04:49
1

You'll have to encode at this line:

r = urllib2.urlopen(urllib2.Request(url.encode('utf-8'))

You need to pass byte strings to Request, so you'll have to do encode().

Also, you would probably want to read Python's Unicode HOWTO and How to percent-encode url parameters in python?

Community
  • 1
  • 1
dazedconfused
  • 1,312
  • 2
  • 19
  • 26
  • thanks! it works for the download but the when I tried to open the downloaded file, it gives "Acrobat could not open '[name of file]' because it is either not a supported file type or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded). – Mark K Dec 15 '14 at 08:20
  • 1
    Also see [How to percent-encode url parameters in python?](http://stackoverflow.com/q/1695183/4014959) – PM 2Ring Dec 15 '14 at 08:20