Python - replace accented characters with non accent

Question

I am parsing some HTML and sometimes I get some characters like é when I read the data, doc = urllib2.urlopen(url).read(), how can I find and replace these characters with there non accent equivalent?

The variable doc is a byte string, I have tried to convert it to unicode string like this

doc = doc.encode('utf-8')
doc = strip_accents(doc)
doc = doc.decode('utf-8')

Where strip_accents is

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

From this question What is the best way to remove accents in a Python unicode string?

But I get error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 161: ordinal not in range(128)

When I try to encode doc

How can I change the accented to non accented characters?

That answer is still correct, but stop working with bytestrings instead of text. — Ignacio Vazquez-Abrams, Nov 27 '15 at 02:37
@IgnacioVazquez-Abrams How can I do that? I pretty sure that `urllib2.urlopen(url).read()` returns a byte string/? — spen123, Nov 27 '15 at 02:43
I recommend using something that knows what it's doing, such as Requests. Using `urllib2` implies that the programmer understands lower-level concepts and will handle them as required. — Ignacio Vazquez-Abrams, Nov 27 '15 at 02:44
@I how does request help remove the accented characters, I just looked it up, it gives me the encoding so now can I just use what I have above now and replace the utf-8 encoding with what I know it is now? — spen123, Nov 27 '15 at 02:50
It cannot be done on byte strings. You need to figure out the encoding used, and decode that byte string to unicode text. And also please **avoid asking the same question twice** → [Remove accented characters form string - Python](http://stackoverflow.com/questions/33948042/remove-accented-characters-form-string-python), and do not add tags to the title of your question. — roeland, Nov 27 '15 at 04:07
@spenf10 This is clearly a duplicate of your previous question. If you wanted to clarify, you could **[edit](http://stackoverflow.com/posts/33948042/edit)** your question. — Mariano, Nov 27 '15 at 08:13

Python - replace accented characters with non accent

0 Answers0