8

I will give the example from Turkish, for example "şğüı" becomes "sgui"

I'm sure each language has it's own conversion methods, sometimes a character might be converted to multiple ASCII characters, like "alpha"/"phi" etc.

I'm wondering whether there is a library/method that achieves this conversion

一二三
  • 21,059
  • 11
  • 65
  • 74
Kaan Soral
  • 1,589
  • 1
  • 15
  • 31

1 Answers1

7

What you are asking is called transliteration.

Try the Unidecode library.

rodrigo
  • 94,151
  • 12
  • 143
  • 190
  • 2
    Are there any non-GPL alternatives to Unidecode? – Rjak Jun 08 '17 at 18:46
  • 1
    @Rjak: What about this [answer](https://stackoverflow.com/a/1207479/865874) linked above by Martín Muñoz del Río. It uses `unicodedata` that is part of the Python standard library. – rodrigo Jun 08 '17 at 19:40
  • 2
    Hello @rodrigo - the problem with unicodedata is that is does replacement, not transliteration. For our application, it would be best if we could find the closest "equivalent" ASCII character (i.e. transliterate). For example, with the latin name "Piekło", Unidecode would return "Pieklo", which is what we want. Unicodedata returns "Pieko" (removal) or "Piek?o" (replacement) depending on what you pass for the behavior argument. – Rjak Jun 09 '17 at 20:15
  • 1
    @Rjak: Well, the problem is that Unicode does not define `ł` as a composed character so the decompose normalization trick does not work... If you have a limited set of characters you want to transliterate (just for Polish names, for example) you can build the table yourself. Other than that and Unidecode I don't know any other, sorry. – rodrigo Jun 09 '17 at 20:38
  • 1
    No need to be sorry, @rodrigo. I understand the complexities of transliteration, that's why I was looking for a library. Our lawyers will not allow us to use GPL in certain parts of our codebase, so finding a non-GPL library would be awesome. – Rjak Jun 14 '17 at 13:49