Remove accented characters form string - Python

Question

I get some data from a webpage and read it like this in python

origional_doc = urllib2.urlopen(url).read()

Sometimes this url has characters such as é and ä and ect., how could I remove these characters, from the string, right now this is what I am trying,

import unicodedata
origional_doc = ''.join((c for c in unicodedata.normalize('NFD', origional_doc) if unicodedata.category(c) != 'Mn'))

But I get an error

TypeError: must be unicode, not str

Is `origional_doc` a byte string or an unicode string? – roeland Nov 27 '15 at 00:19 — roeland, Nov 27 '15 at 00:19

score 0 · Answer 1 · 2015-11-26T23:16:47.587

0

This should work. It will eliminate all characters that are not ascii.

    original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))

edited Nov 26 '15 at 23:16

answered Nov 26 '15 at 23:06

what will it replace them with – spen123 Nov 26 '15 at 23:13
As I said it would eliminate them, therefore it would replace them with nothing. If you want to replace them with something specific use original_doc = original_doc("é", "e") – Nov 26 '15 at 23:15

score -1 · Answer 2 · answered Nov 26 '15 at 22:58

-1

using re you can sub all characters that are in a certain hexadecimal ascii range.

>>> re.sub('[\x80-\xFF]','','é and ä and ect')
' and  and ect'

You can also do the inverse and sub anything thats NOT in the basic 128 characters:

>>> re.sub('[^\x00-\x7F]','','é and ä and ect')
' and  and ect'

answered Nov 26 '15 at 22:58

R Nar

5,465
1
16
32

could you show an example with a string, and I use 'ect' to mean etcetera, I want it to replace all these will there closest equilivent – spen123 Nov 26 '15 at 23:14
I know you did, I just copied the text from your question. and this is a string so I dont know what you mean by show an example with a string – R Nar Nov 26 '15 at 23:16
in terms of replacing them with their non-accented equivalent (which, might I add, strays from your original question) check [this question](http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) – R Nar Nov 26 '15 at 23:17

Remove accented characters form string - Python

2 Answers2

Linked

Related