0

I am trying to use HTMLParser and urllib2 to get to an image file

content = urllib2.urlopen( imgurl.encode('utf-8') ).read()
try:
    p = MyHTMLParser(  )
    p.feed( content )
    p.download_file( )
    p.close()
except Exception,e:
    print e

MyHTMLParser:

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)        
        self.url=""
        self.outfile = "some.png"

    def download_file(self):
        urllib.urlretrieve( self.url, self.outfile )

    def handle_starttag(self, tag, attrs):
        if tag == "a":
           # after some manipulation here, self.url will have a img url
           self.url = "http://somewhere.com/Fondue%C3%A0.png"

when i run the script, i get

Traceback (most recent call last):
File "test.py", line 59, in <module>
p.feed( data )
File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 56: ordinal not in range(128)

Using the suggestions i found in the found, i did the .encode('utf-8') method, but it still gives me error. how to fix this ? thanks

dorothy
  • 1,213
  • 5
  • 20
  • 35
  • 1
    That error message should come with a file name and line number pointing you to which line is giving the error. Which line is it? – Sam Mussmann Dec 20 '13 at 19:03
  • @SamMussmann, i added the actual trace back. thanks – dorothy Dec 20 '13 at 23:44
  • HTML parsers parse HTML. That's an actual image. Why don't you just download the file? – Blender Dec 20 '13 at 23:47
  • @Blender, actually the img url comes after i do some string manipulation during runtime. So i won't know the exact url at the start. – dorothy Dec 20 '13 at 23:50
  • @dorothy: But the problem remains: is `url` pointing to an HTML page (like [this](http://en.wikipedia.org/wiki/File:Greatest_common_divisor_chart.png)), or is it the URL for a PNG image (like [this](http://upload.wikimedia.org/wikipedia/commons/a/ad/Greatest_common_divisor_chart.png))? – Blender Dec 20 '13 at 23:54
  • @Blender, yes, the self.url will eventually contain the PNG image link. The html page i will have to parse it to get to that png image link and then store in self.url – dorothy Dec 21 '13 at 00:00
  • here's an [example how `Content-type` header is used to get character encoding and HTMLParser is used to get cities names](http://stackoverflow.com/a/13517891/4279). – jfs Dec 21 '13 at 01:28

1 Answers1

1

Replace

content = urllib2.urlopen( url.encode('utf-8') ).read()

with

content = urllib2.urlopen(url).read().decode('utf-8')

To decode the response into unicode.

Matt Williamson
  • 39,165
  • 10
  • 64
  • 72
  • @dorothy: utf-8 is *not* the only character encoding that may be used in an html document. It may be specified in http header (e.g., `Content-type: text/html; charset=utf-8`), inside the html (e.g., ``), [etc](http://en.wikipedia.org/wiki/Character_encodings_in_HTML) – jfs Dec 21 '13 at 01:14