urllib2 encoding issue

Question

This is my example script:

import urllib2, re

response = urllib2.urlopen('http://domain.tld/file')
data     = response.read() # Normally displays "the emoticon <3 is blah blah"

pattern   = re.search('(the emoticon )(.*)( is blah blah)', data)
result    = pattern.group(2) # result should contain "<3" now

print 'The result is ' + result # prints "&lt;3" because not encoded

As you can see, I am obtaining a page and trying to get a string out of it, but it isn't encoded correctly as I am not sure what to add to this script o make the end result correct. Could anyone point out what I am doing wrong?

You might want to take a look at [this question](http://stackoverflow.com/questions/1208916/decoding-html-entities-with-python). — Gareth Latty, May 12 '12 at 03:09
@Lattyware Looked, didn't see much help as I'd rather not use an external module for this. — Markum, May 12 '12 at 03:20

score 1 · Accepted Answer · answered May 12 '12 at 05:29

1

try this:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('wer&amp;wer')
u'wer&wer'

answered May 12 '12 at 05:29

lenik

23,228
4
34
43

urllib2 encoding issue

1 Answers1