4

Normalizing characters using unicodedata did not work for Cyrillic characters. How can I convert Cyrillic into Latin characters?

import unicodedata

cyrillic = 'НОMЕ СHEF'
ordinary = 'HOME CHEF'
print(cyrillic == ordinary)
# prints False, must be True

string = unicodedata.normalize('NFKD', cyrillic)
string = string.encode('ASCII', 'ignore').decode('utf-8')
print(string)
# prints M HEF, must be HOME CHEF
  • Note `M HEF` is coming out because those are ASCII characters and the other ones aren't. If you want a more Cyrillic string, use `'\u041d\u041e\u041c\u0415 \u0421\u041d\u0415F'` (where everything but `F` is Cyrillic). – wjandrea Dec 22 '19 at 19:16
  • Also note that this isn't transliteration or transcription. I'm not sure what exactly it is (maybe "transgraphication"?) but it's like the reverse of [faux-Cyrillic](https://en.wikipedia.org/wiki/Faux_Cyrillic). You might find this useful: https://github.com/airbunny/Faux-cyrillic, though it goes the long way. The short way is with [`str.translate`](https://docs.python.org/3/library/stdtypes.html#str.translate). – wjandrea Dec 22 '19 at 19:22
  • 1
    They are called homoglyphs. https://en.wikipedia.org/wiki/Homoglyph – James Dec 22 '19 at 19:29
  • Related: [Translate Unicode to ascii (if possible)](https://stackoverflow.com/q/43367355/4518341) – wjandrea Dec 22 '19 at 19:50

0 Answers0