Detect same words using different alphabets?

Question

Python treats words МАМА and MAMA differently because one of them is written using latin and another using cyrillian.

How to make python treat them as one same string?

I only care about allomorphs.

@jonrsharpe It's a question about characters that look similar but otherwise have nothing in common. It's not about "squeezing" a Unicode string into an ASCII representation. — vpekar, Oct 23 '16 at 21:48
@vpekar but that's *how* you convince it they're the same string, it won't believe you otherwise — jonrsharpe, Oct 23 '16 at 21:51
@jonrsharpe the question is about transliteration. A Russian character can conventionally be transliterated into a Latin one, and you can't achieve this using unicodedata.normalize. See updated answer from Brendan Abel. — vpekar, Oct 23 '16 at 21:54
@vpekar OK, reopened, but did you actually read the first answer to the dupe I suggested? `unidecode` does all that — jonrsharpe, Oct 23 '16 at 21:56
@jonrsharpe Yes, the first answer to that question is also a solution here. — vpekar, Oct 23 '16 at 22:06
Can you clarify if you only care about allomorphs, or does transliteration count? How would you want to deal with "PAPA", for example? — Paul, Oct 23 '16 at 22:40

Brendan Abel · Answer 1 · 2016-10-23T21:53:15.033

3

There is a python library that will do the cyrillic to latin unicode translations called transliterate

>>> from transliterate import translit
>>> 
>>> cy = u'\u041c\u0410\u041c\u0410'
>>> en = u'MAMA'
>>> cy == en
False
>>> cy_converted = translit(cy, 'ru', reversed=True)
>>> cy_converted == en
True
>>> cy_converted
u'MAMA'

edited Oct 23 '16 at 21:53

answered Oct 23 '16 at 21:41

Brendan Abel

35,343
14
88
118

1

Wouldn't this make "ДРП" be considered equal to "DRP"? It kinda sounds like the OP only wants to consider allomorphs equal. This also seems like "РАРА" would transliterate to "RARA", which I suspect the OP doesn't want. – Paul Oct 23 '16 at 22:38

score 2 · Accepted Answer · answered Oct 24 '16 at 08:03

Transliteration is not going to help (it will turn Cyrillic P into Latin R). At first glance, Unicode compatibility form (NFKD or NFKC) look hopeful, but that turns U+041C (CYRILLIC CAPITAL LETTER EM) into U+041C (and not U+004D (LATIN CAPITAL LETTER EM)) - so that won't work.

The only solution is to build your own table of allomorphs, and translate all strings into a canonical form before comparing.

Note: When I said "Cyrillic P", I cheated and used the Latin allomorph - I don't have an easy way to enter Cyrillic.

OK. Thanks. I think it is the only way. – Paul R Oct 24 '16 at 08:37 — Paul R, Oct 24 '16 at 08:37

score 0 · Answer 3 · answered Oct 23 '16 at 21:40

0

You might want to use normalize method. https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

answered Oct 23 '16 at 21:40

dannyxn

422
4
16

Detect same words using different alphabets?

3 Answers3