removing accent and special characters

Question

Possible Duplicate:
What is the best way to remove accents in a python unicode string?
Python and character normalization

I would like to remove accents, turn all characters to lowercase, and delete any numbers and special characters.

Example :

Frédér8ic@ --> frederic

Proposal:

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if \
    unicodedata.category(x)[0] == 'L').lower()

Is there any better way to do this?

Could you edit your answer to include some examples of desired input and output? — Christian Neverdal, Jan 01 '12 at 18:56
@Christian Jonassen Frédér8ic@ --> frederic @@àbcd --> abcd %*tréçd --> trecd — Fred, Jan 01 '12 at 19:00
This is possibly not a duplicate considering OP wanted something more than unicode normalization. — Abhijit, Jan 01 '12 at 19:45

score 15 · Accepted Answer · edited Mar 13 '18 at 09:58

15

A possible solution would be

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.printable).lower()

Using NFKD AFAIK is the standard way to normalize unicode to convert it to compatible characters. The rest as to remove the special characters numbers and unicode characters that originated from normalization, you can simply compare with string.ascii_letters and remove any character's not in that set.

edited Mar 13 '18 at 09:58

heemayl

39,294
7
70
76

answered Jan 01 '12 at 19:41

Abhijit

62,056
18
131
204

2

But what is the string variable in that command? Where you refer `if x in string.ascii_letters` – Falcoa Jan 17 '17 at 12:33
@Falcoa is rigth. There's another solution def remove_accents(self, data): return unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore') – lesimoes Apr 19 '17 at 15:32

score 1 · Answer 2 · edited May 23 '17 at 12:08

1

Can you convert the string into HTML entities? If so, you can then use a simple regular expression.

The following replacement would work in PHP/PCRE (see my other answer for an example):

'~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i' => '$1'

Then simply convert back from HTML entities and remove any non a-Z char (demo @ CodePad).

Sorry I don't know Python enough to provide a Pythonic answer.

edited May 23 '17 at 12:08

Community

1
1

answered Jan 01 '12 at 19:13

Alix Axel

151,645
95
393
500

1

I'm not sure that the regex are more efficient than UnicodeData – Fred Jan 01 '12 at 19:36
@user1125315: I'm not sure either, but it correctly passes your input/output tests. Feel free to try other approaches though, the `unidecode` lib seems awesome. – Alix Axel Jan 01 '12 at 19:41

removing accent and special characters

2 Answers2

Linked