encoding/decoding unicode and utf-8 : Python

Question

I have a html text : If I'm reading lots of articles

I am trying to replace ' and other such special characters into unicode '. I did

rawtxt.encode('utf-8').encode('ascii','ignore')

, but it fails

Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2

It looks like this is not really the code that produces the error because the error comes from trying to decode the string as ascii. Where does rawtxt come from? — Sarien, May 16 '13 at 12:01
@Sarien: it is the code that produces the error. You can get a decode error in a call to `encode`. See: http://chat.stackoverflow.com/rooms/10/conversation/python2-decode-error-when-encoding — R. Martinho Fernandes, May 16 '13 at 13:04

score 3 · Accepted Answer · answered May 16 '13 at 11:54

3

You're having problems with HTML entities, not unicode or UTF-8. Try this:

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape('If I&#039;m reading lots of articles')
print s

This prints If I'm reading lots of articles.

answered May 16 '13 at 11:54

likeitlikeit

1 Answers1