3

Python treats words МАМА and MAMA differently because one of them is written using latin and another using cyrillian.

How to make python treat them as one same string?

I only care about allomorphs.

Paul R
  • 2,631
  • 3
  • 38
  • 72
  • search about `encoding` and `decoding` of strings in python – Moinuddin Quadri Oct 23 '16 at 21:38
  • @jonrsharpe It's a question about characters that look similar but otherwise have nothing in common. It's not about "squeezing" a Unicode string into an ASCII representation. – vpekar Oct 23 '16 at 21:48
  • @vpekar but that's *how* you convince it they're the same string, it won't believe you otherwise – jonrsharpe Oct 23 '16 at 21:51
  • @jonrsharpe the question is about transliteration. A Russian character can conventionally be transliterated into a Latin one, and you can't achieve this using unicodedata.normalize. See updated answer from Brendan Abel. – vpekar Oct 23 '16 at 21:54
  • @vpekar OK, reopened, but did you actually read the first answer to the dupe I suggested? `unidecode` does all that – jonrsharpe Oct 23 '16 at 21:56
  • @jonrsharpe Yes, the first answer to that question is also a solution here. – vpekar Oct 23 '16 at 22:06
  • 1
    Can you clarify if you only care about allomorphs, or does transliteration count? How would you want to deal with "PAPA", for example? – Paul Oct 23 '16 at 22:40

3 Answers3

3

There is a python library that will do the cyrillic to latin unicode translations called transliterate

>>> from transliterate import translit
>>> 
>>> cy = u'\u041c\u0410\u041c\u0410'
>>> en = u'MAMA'
>>> cy == en
False
>>> cy_converted = translit(cy, 'ru', reversed=True)
>>> cy_converted == en
True
>>> cy_converted
u'MAMA'
Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
  • 1
    Wouldn't this make "ДРП" be considered equal to "DRP"? It kinda sounds like the OP only wants to consider allomorphs equal. This also seems like "РАРА" would transliterate to "RARA", which I suspect the OP doesn't want. – Paul Oct 23 '16 at 22:38
2

Transliteration is not going to help (it will turn Cyrillic P into Latin R). At first glance, Unicode compatibility form (NFKD or NFKC) look hopeful, but that turns U+041C (CYRILLIC CAPITAL LETTER EM) into U+041C (and not U+004D (LATIN CAPITAL LETTER EM)) - so that won't work.

The only solution is to build your own table of allomorphs, and translate all strings into a canonical form before comparing.

Note: When I said "Cyrillic P", I cheated and used the Latin allomorph - I don't have an easy way to enter Cyrillic.

0

You might want to use normalize method. https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

dannyxn
  • 422
  • 4
  • 16