Sort Cyrillic strings before Latin in python

Question

In my database, I have records in both Cyrillic and Latin characters. By default, they are listed alphabetically with Latin records first:

abc... bcd... cde... абв...

I would like to put the Cyrillic to the first place:

абв... abc... bcd... cde...

What I have tried so far:

This solution. It is not so great because it only sorts by the first word, and I can have both Cyrillic and Latin words in the same string (or even mixed characters in the same word).
Writing my own lists with Cyrillic and Latin alphabets. It works but is not great at all. I cannot take into account all possible letters in the two alphabets, including those with diacritics and write them down.

I have also been looking into PyICU but don't see how I can put it to use.

My guess is that I should use some custom collation here. The question is how this can be done in practice.

So, what is the exact order you want to achieve? "I would like to put the Cyrillic to the first place" - should any string which contains Cyrillic character come before all-Latin strings? — Zaur Nasibov, Mar 05 '20 at 07:52
@ZaurNasibov Not really. I want that every Cyrillic character come before a Latin character. For example, 'waюnt' (with one Cyrillic character) should come before 'wannt': the third letter differs in the two words, and the Cyrillic comes before the Latin. But if we have 'wqюнт' and 'wannt', here 'wannt' should come first, as we compare the second letter, and 'a' precedes 'q' in the Latin alphabet. — Edmond, Mar 05 '20 at 07:59
You could use a `key` function in `sorted` or `list.sort` that pre-pends eg a `0` if the first string character is Cyrillic, 1 if not. — Panagiotis Kanavos, Mar 05 '20 at 08:05
@PanagiotisKanavos I need to take into account every character, not the first one only. — Edmond, Mar 05 '20 at 08:06
@Edmond what do you expect in case of 'zжz', 'zzв' ? And also in 'zвz', 'zzж'? — Alex Sveshnikov, Mar 05 '20 at 08:10
Then the `key` lambda will have some extra work to do. Worst case, you could prepend every character. The key point is that sorting depends entirely on the `key` lambda. The key result doesn't have to look anything like the input as long as it produces the desired sort order. [This SO question](https://stackoverflow.com/questions/243831/unicode-block-of-a-character-in-python) shows how to detect the Unicode block of a character. Or you could just check the numeric value [against the block's range](https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)) — Panagiotis Kanavos, Mar 05 '20 at 08:12
@Alex We compare every letter in an iteration. So, 1st letter: 'z' and 'z': no difference; second letter: 'ж' and 'z': we have a Cyrillic and a Latin letter, so the Cyrillic precedes. We finish our iteration, and the sorting is: zжz... zzв. — Edmond, Mar 05 '20 at 08:13
@snakecharmerb PostgreSQL. But the sorting actually occurs with the strings after they have been retrieved from the database. — Edmond, Mar 05 '20 at 08:21

AKX · Answer 1 · 2020-03-05T08:04:46.573

2

One way to do this is to use the transliterate module or maybe cytranslit and use a sort key that transliterates everything to the desired alphabet:

import transliterate

items = ['abc', 'bcd', 'cde', 'абв']

print(sorted(items, key=lambda x: transliterate.translit(x, 'ru')))

The output is the desired

['абв', 'abc', 'bcd', 'cde']

edited Mar 05 '20 at 08:04

answered Mar 05 '20 at 07:58

AKX

152,115
15
115
172

1

The Russian alphabet is not what I want. I need the Cyrillic alphabet, not Russian. To put it more clearly, all characters of Cyrillic-based languages. – Edmond Mar 05 '20 at 08:00
Nice use of transliterate module, had no idea about it :) – Zaur Nasibov Mar 05 '20 at 08:01
3

It only works because all Latin strings in the example were transliterated to Russian strings which sort higher than 'абв'. If the first string in the example is 'aba', not 'abc', then it doesn't work. It will also completely mess up sorting of pure Latin strings, for example, sorted('ccc', 'ddd') will give you ('ddd', 'ccc') – Alex Sveshnikov Mar 05 '20 at 08:02
@Edmond I added a reference to `cytranslit` too. It contains mapping tables you might be able to use. https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py – AKX Mar 05 '20 at 08:05
OK, thanks for your suggestion and correction, but this definitely does not fit what I am asking for. – Edmond Mar 05 '20 at 08:06

score 1 · Answer 2 · answered Mar 05 '20 at 08:21

1

IMO this is not a trivial thing. I'd say that a collation is indeed required.

So, say a key function would convert a string to a tuple of codepoints, where all non-Cyrillic code points would be shifted by 100000):

import unicodedata

def key(s):
    SHIFT = 100000
    return tuple(
        ord(c) if is_cyrillic(c) else ord(c) + SHIFT
        for c in s
    )

def is_cyrillic(c):
    return unicodedata.name(c).startswith('CYRILLIC')        


>>> sorted(('wannt', 'waюnnt'), key=key)
Out[34]: ['waюnnt', 'wannt']

is_cyrillic can be optimized by using a preliminary table or caching the Cyrillic characters from the database strings.

answered Mar 05 '20 at 08:21

Zaur Nasibov

22,280
12
56
83

2

An optimization could be to check the character's codepoint against the Cyrillic range (U+0400-U+04FF). – Panagiotis Kanavos Mar 05 '20 at 08:35
@PanagiotisKanavos, good point! This would be much faster. For full range, all the extensions and supplements have to be checked too. – Zaur Nasibov Mar 05 '20 at 08:45

score 0 · Answer 3 · answered Mar 05 '20 at 08:29

0

You can try to generate a key by prepending every character with 1 if it's a Latin character and with 0 otherwise:

sorted(items, key = lambda item : ['1' + x if x < '\x7f' else '0' + x for x in item])

answered Mar 05 '20 at 08:29

Alex Sveshnikov

4,214
1
10
26

`á` is a latin letter too. Test for Cyrillic, as that is contained in just a few Unicode subranges. Nice idea, still. – Jongware Mar 05 '20 at 08:34

Malvina Pushkova · Answer 4 · 2021-10-29T20:09:11.023

Very dirty but working solution for sorting according the first character. It also eliminates difference in register of letters.

ls = ['32', '24', 'xyz', 'WYZ', 'abc', 'абв', 'КЛМ', 'эюя', 'еёж', 'ёжз', '', '_']

def sort_rule(st):
    st = st.lower()
    ch = st[0] if st else ''       
    if ch >= 'а' and ch <= 'я':
        st = '1' + st
    elif ch == 'ё':
        st = '1е' + st
    elif ch >= 'a' and ch <= 'z':
        st = '2' + st
    else:
        st = '3' + st
    return st

sorted(ls, key=sort_rule)

> ['абв', 'еёж', 'ёжз', 'КЛМ', 'эюя', 'abc', 'WYZ', 'xyz', '', '24', '32', '_']

For comparison, default sorting gives the next result:

sorted(ls)

> ['', '24', '32', 'WYZ', '_', 'abc', 'xyz', 'КЛМ', 'абв', 'еёж', 'эюя', 'ёжз']

The question is about Cyrillic, not Russian. This answers covers the Russian langauge only. — Edmond, Oct 10 '21 at 14:26

Andj · Answer 5 · 2023-03-19T12:37:37.713

This in an old question, with four answers, but I feel the following answer is more appropriate.

If you have both icu4c and PyICU installed, there is a fairly simple solution. ICU uses the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm.

Since you want a language insensitive sort with all Cyrillic characters sorted first, the simplest approach is to tailor the CLDR root collation. All tailorings of ICU collations are tailorings of the root collation, and you provide the minimum changes required.

In this scenario, you just need a reorder directive:

from icu import RuleBasedCollator
items = ['cde', 'abc', 'bcd', 'абв']
rules = "[reorder Cyrl]"
collator = RuleBasedCollator(rules)
sorted(items, key=collator.getSortKey)
# ['абв', 'abc', 'bcd', 'cde']

But the above rules do exactly what the Russian collation rules do, reordering Cyrillic based on the root collation.

from icu import Locale, Collator
items = ['cde', 'abc', 'bcd', 'абв']
collator = Collator.createInstance(Locale("ru"))
print(sorted(items, key=collator.getSortKey))

Sort Cyrillic strings before Latin in python

5 Answers5