10

Possible Duplicate:
What is the best way to remove accents in a python unicode string?
Python and character normalization

I would like to remove accents, turn all characters to lowercase, and delete any numbers and special characters.

Example :

Frédér8ic@ --> frederic

Proposal:

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if \
    unicodedata.category(x)[0] == 'L').lower()

Is there any better way to do this?

Community
  • 1
  • 1
Fred
  • 1,011
  • 1
  • 10
  • 36

2 Answers2

15

A possible solution would be

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.printable).lower()

Using NFKD AFAIK is the standard way to normalize unicode to convert it to compatible characters. The rest as to remove the special characters numbers and unicode characters that originated from normalization, you can simply compare with string.ascii_letters and remove any character's not in that set.

heemayl
  • 39,294
  • 7
  • 70
  • 76
Abhijit
  • 62,056
  • 18
  • 131
  • 204
  • 2
    But what is the string variable in that command? Where you refer `if x in string.ascii_letters` – Falcoa Jan 17 '17 at 12:33
  • @Falcoa is rigth. There's another solution def remove_accents(self, data): return unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore') – lesimoes Apr 19 '17 at 15:32
1

Can you convert the string into HTML entities? If so, you can then use a simple regular expression.

The following replacement would work in PHP/PCRE (see my other answer for an example):

'~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i' => '$1'

Then simply convert back from HTML entities and remove any non a-Z char (demo @ CodePad).

Sorry I don't know Python enough to provide a Pythonic answer.

Community
  • 1
  • 1
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • 1
    I'm not sure that the regex are more efficient than UnicodeData – Fred Jan 01 '12 at 19:36
  • @user1125315: I'm not sure either, but it correctly passes your input/output tests. Feel free to try other approaches though, the `unidecode` lib seems awesome. – Alix Axel Jan 01 '12 at 19:41