python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

Question

I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.

This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)

I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?

All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).

Do you know which encoding was used? If not, then you have to guess it, convert to unicode and re-save the data as UTF-8. Beautiful parser is usually good at guessing encodings, but you may try [`chardet`](http://pypi.python.org/pypi/chardet) also. — Bakuriu, Jan 12 '13 at 13:18
It's difficult to help without seeing the script that produces the error. — Fredrick Brennan, Jan 12 '13 at 13:25
@Amyth - I did try .encode() and also .decode().encode() to no success unfortunately. — Joe, Jan 12 '13 at 13:25
@Bakuriu - was hoping to avoid having to add overhead of charset detection which isn't guaranteed 100% — Joe, Jan 12 '13 at 13:29
@Bakuriu based on the small amount of actual info in the post, it is likely to be UTF-8 with `0xe2` being a common UTF-8 lead byte since it's for smart quotes and such — Esailija, Jan 12 '13 at 13:31
@Daniel There's really no point in showing the code, all it is is a db command to retried the record from database, a 50 lines of beautifulsoup that returns a string variable, and a few str.replace('','') lines. The issue isn't with the code as it works fine, the issue is with just some records in the database containing unicode characters that I'm looking to find a fix for. — Joe, Jan 12 '13 at 13:32
Make sure you have read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" http://www.joelonsoftware.com/articles/Unicode.html , It really helped me to fight the unicode error issue definitely — Jakub M., Jan 12 '13 at 13:34
Unless you are calling `.decode("ascii")` somewhere, you really need to show the code. — Esailija, Jan 12 '13 at 13:37

score 2 · Accepted Answer · answered Jan 12 '13 at 17:07

Before you do any further processing with your string variable:

clean_str = unicode(str_var_with_strange_coding, errors='ignore')

The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.

score 1 · Answer 2 · answered Jan 12 '13 at 13:50

Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.

When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.

db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet

After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.

score 1 · Answer 3 · edited May 23 '17 at 12:04

Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.

But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.

Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode

Regarding BeautifulSoup, use the latest version 4.

score 1 · Answer 4 · answered Jan 12 '13 at 16:52

Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.

In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.

Answer found on: http://ubuntuforums.org/showthread.php?t=1212933

I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)

Glad it already works for you. Nevertheless, if you get your input data from any random internet pages, you can wait for the next error, as some pages deliver mixed encoding. Very famous, currency signs in ISO 8859 encoding in a otherwise complete unicode page. If you run into these errors, remember the errors='ignore' flag, when you convert a string to unicode. — Michael, Jan 12 '13 at 17:14

python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

4 Answers4