1

I have a html text : If I'm reading lots of articles

I am trying to replace ' and other such special characters into unicode '. I did

rawtxt.encode('utf-8').encode('ascii','ignore') 

, but it fails

Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
Harshit
  • 1,207
  • 1
  • 20
  • 40
  • It looks like this is not really the code that produces the error because the error comes from trying to decode the string as ascii. Where does rawtxt come from? – Sarien May 16 '13 at 12:01
  • @Sarien: it is the code that produces the error. You can get a decode error in a call to `encode`. See: http://chat.stackoverflow.com/rooms/10/conversation/python2-decode-error-when-encoding – R. Martinho Fernandes May 16 '13 at 13:04

1 Answers1

3

You're having problems with HTML entities, not unicode or UTF-8. Try this:

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape('If I'm reading lots of articles')
print s

This prints If I'm reading lots of articles.

likeitlikeit
  • 5,563
  • 5
  • 42
  • 56