Converting html source content into readable format with Python 2.x

Question

Python 2.7

I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.

This is what I've tried so far:

>>> import urllib2
>>> urllib2.unquote('&pound;')
'&pound;'

So that didn't work... Then I tried:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('&pound;')
u'\xa3'

as you can see that doesn't work either nor any combination of the two.

I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.

Does anyone know how to do this, how to convert HTML content into a readable format in python?

Check out BeautifulSoup. – Joel Cornett Jul 28 '12 at 20:18 — Joel Cornett, Jul 28 '12 at 20:18

score 1 · Answer 1 · answered Jul 28 '12 at 21:13

1

Why doesn't that work?

In [1]: s = u'\xa3'

In [2]: s
Out[2]: u'\xa3'

In [3]: print s
£

When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.

answered Jul 28 '12 at 21:13

dav1d

5,917
1
33
52

score 1 · Answer 2 · answered Jul 28 '12 at 21:15

The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.

The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('&pound;')
u'\xa3'
>>> print h.unescape('&pound;')
£

score 1 · Accepted Answer · edited May 23 '17 at 12:21

£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:

>>> print u'\xa3'
£

When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.

If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:

>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'

You get a two-byte string representing the single "POUND SIGN" character.

I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.

score 0 · Answer 4 · answered Jul 29 '12 at 20:53

0

lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

answered Jul 29 '12 at 20:53

starenka

580
3
9

Converting html source content into readable format with Python 2.x

4 Answers4

Linked