Python - HTML to Unicode

Question

I have a python script where I am getting some html and parsing it using beautiful soup. In the HTML sometimes there are no unicode characters and it causes errors with my script and the file I am creating.

Here is how I am getting the HTML

html = urllib2.urlopen(url).read().replace('&nbsp;',"")
xml = etree.HTML(html)

When I use this

html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')

I get an error UnicodeDecodeError

How could I change this into unicode. So if there are non unicode characters, my code won't break.

possible duplicate of [Convert HTML entities to Unicode and vice versa](http://stackoverflow.com/questions/701704/convert-html-entities-to-unicode-and-vice-versa) — anon582847382, Nov 03 '14 at 20:58
@AlexThornton when I use that I get an error UnicodeDecodErro — iqueqiorio, Nov 03 '14 at 21:00
Could you give a small example of an example input and output string that you might expect? — anon582847382, Nov 03 '14 at 21:15

bobince · Answer 1 · 2014-11-04T09:51:54.840

When I use this

html = urllib2.urlopen(url).read().encode('ascii', 'xmlcharrefreplace')

I get an error UnicodeDecodeError. How could I change this into unicode.

unicode characters -> bytes = ‘encode’
bytes -> unicode characters = ‘decode’

You have bytes and you want unicode characters, so the method for that is decode. As you have used encode, Python thinks you want to go from characters to bytes, so tries to convert the bytes to characters so they can be turned back to bytes! It uses the default encoding for this, which in your case is ASCII, so it fails for non-ASCII bytes.

However it is unclear why you want to do this. etree parses bytes as-is. If you want to remove character U+00A0 Non Breaking Space from your data you should do that with the extracted content you get after HTML parsing, rather than try to grapple with the HTML source version. HTML markup might include U+00A0 as raw bytes, incorrectly-unterminated entity references, numeric character references and so on. Let the HTML parser handle that for you, it's what it's good at.

score 0 · Answer 2 · answered Nov 10 '14 at 08:12

If you feed HTML to BeautifulSoup, it will decode it to Unicode. If the charset declaration is wrong or missing, or parts of the document are encoded differently, this might fail; there is a special module which comes with BeautifulSoup, dammit, which might help you with these documents.

If you mention BeautifulSoup, why don't you do it like this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(url).read())

and work with the soup? BTW, all HTML entities will be resolved to unicode characters.

The ascii character set is very limited and might lack many characters in your document. I'd use utf-8 instead whenever possible.

Python - HTML to Unicode

2 Answers2