I am parsing some HTML and sometimes I get some characters like é
when I read the data, doc = urllib2.urlopen(url).read()
, how can I find and replace these characters with there non accent equivalent?
The variable doc
is a byte string, I have tried to convert it to unicode string like this
doc = doc.encode('utf-8')
doc = strip_accents(doc)
doc = doc.decode('utf-8')
Where strip_accents
is
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
From this question What is the best way to remove accents in a Python unicode string?
But I get error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 161: ordinal not in range(128)
When I try to encode doc
How can I change the accented to non accented characters?