5

Possible Duplicate:
What is the best way to remove accents in a python unicode string?

In a certain system, I need to generate usernames that are only allowed plain-ascii characters (a-z, 0-9, dashes). Many users have names however that don't simply match those restrictions, for example the German names "Müller" or "Röthlin".

Now those umlauts have an alternative way of typing them (I'm sure there's a name for it, but I don't know that - might help Googling)

A naive approach would be to employ a translation table:

name = name.replace('Ä', 'Ae')
name = name.replace('ä', 'ae')
name = name.replace('ö', 'oe')

and so forth.

This approach however fails as soon as you have users from cultures other than, say, German, where other characters might appear. So I'm looking for a generic way to "convert" as many non-ascii characters as possible before falling back to simply strip them out.

Community
  • 1
  • 1
Dave Vogt
  • 18,600
  • 7
  • 42
  • 54
  • 4
    see [unidecode module](http://pypi.python.org/pypi/Unidecode/) – jfs Aug 15 '12 at 20:26
  • but you would need a table anyway to establish which sign coresponds to which? – Davoud Taghawi-Nejad Aug 15 '12 at 20:30
  • Similar questions have been asked before, see http://stackoverflow.com/questions/517923, here http://stackoverflow.com/questions/8694815 and here http://stackoverflow.com/questions/4162603/. – raju-bitter Aug 15 '12 at 20:33
  • Do umlauts *have* an ASCII equivalent? ASCII's only like 128 characters. – Waleed Khan Aug 15 '12 at 20:38
  • @arxanas At least for the German ones (äöüÄÖÜ and ß), there are standard (as in, official) ways to approximate them using the plain latin alphabet, explicitly intended for cases when the correct letters cannot be used. I imagine there are at least some conventions along those lines in other languages as well. –  Aug 15 '12 at 20:52
  • @jfs `unidecode` does not work for this German use-case. See [FAQ](https://pypi.org/project/Unidecode/): _"German umlauts are transliterated incorrectly"_ – malfroid Apr 15 '22 at 08:49
  • @malfroid the question is explicitly about non-german-specific generic transliteration which unidecode supports by design – jfs Apr 15 '22 at 17:35

0 Answers0