0

I am parsing some HTML and sometimes I get some characters like é when I read the data, doc = urllib2.urlopen(url).read(), how can I find and replace these characters with there non accent equivalent?

The variable doc is a byte string, I have tried to convert it to unicode string like this

doc = doc.encode('utf-8')
doc = strip_accents(doc)
doc = doc.decode('utf-8')

Where strip_accents is

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

From this question What is the best way to remove accents in a Python unicode string?

But I get error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 161: ordinal not in range(128)

When I try to encode doc

How can I change the accented to non accented characters?

Community
  • 1
  • 1
spen123
  • 3,464
  • 11
  • 39
  • 52
  • 1
    That answer is still correct, but stop working with bytestrings instead of text. – Ignacio Vazquez-Abrams Nov 27 '15 at 02:37
  • @IgnacioVazquez-Abrams How can I do that? I pretty sure that `urllib2.urlopen(url).read()` returns a byte string/? – spen123 Nov 27 '15 at 02:43
  • 1
    I recommend using something that knows what it's doing, such as Requests. Using `urllib2` implies that the programmer understands lower-level concepts and will handle them as required. – Ignacio Vazquez-Abrams Nov 27 '15 at 02:44
  • @I how does request help remove the accented characters, I just looked it up, it gives me the encoding so now can I just use what I have above now and replace the utf-8 encoding with what I know it is now? – spen123 Nov 27 '15 at 02:50
  • @IgnacioVazquez-Abrams – spen123 Nov 27 '15 at 02:55
  • The `text` attribute gives you actual text, if possible. – Ignacio Vazquez-Abrams Nov 27 '15 at 02:57
  • It cannot be done on byte strings. You need to figure out the encoding used, and decode that byte string to unicode text. And also please **avoid asking the same question twice** → [Remove accented characters form string - Python](http://stackoverflow.com/questions/33948042/remove-accented-characters-form-string-python), and do not add tags to the title of your question. – roeland Nov 27 '15 at 04:07
  • @spenf10 This is clearly a duplicate of your previous question. If you wanted to clarify, you could **[edit](http://stackoverflow.com/posts/33948042/edit)** your question. – Mariano Nov 27 '15 at 08:13

0 Answers0