Python treats words МАМА
and MAMA
differently because one of them is written using latin and another using cyrillian.
How to make python treat them as one same string?
I only care about allomorphs.
Python treats words МАМА
and MAMA
differently because one of them is written using latin and another using cyrillian.
How to make python treat them as one same string?
I only care about allomorphs.
There is a python library that will do the cyrillic to latin unicode translations called transliterate
>>> from transliterate import translit
>>>
>>> cy = u'\u041c\u0410\u041c\u0410'
>>> en = u'MAMA'
>>> cy == en
False
>>> cy_converted = translit(cy, 'ru', reversed=True)
>>> cy_converted == en
True
>>> cy_converted
u'MAMA'
Transliteration is not going to help (it will turn Cyrillic P into Latin R). At first glance, Unicode compatibility form (NFKD or NFKC) look hopeful, but that turns U+041C (CYRILLIC CAPITAL LETTER EM) into U+041C (and not U+004D (LATIN CAPITAL LETTER EM)) - so that won't work.
The only solution is to build your own table of allomorphs, and translate all strings into a canonical form before comparing.
Note: When I said "Cyrillic P", I cheated and used the Latin allomorph - I don't have an easy way to enter Cyrillic.
You might want to use normalize method. https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize