2

There are many situations where the user's language is not a "latin" script (examples include: Greek, Russian, Chinese). In most of these cases a sorting is done by

  • first sorting the special characters and numbers (numbers in local language though...),
  • secondly the words in the local language-script
  • at the end, any non native characters such as French, English or German "imported" words, in a general utf collation.

Or even more specific for the rest...:

is it possible to select the sort based on script?

Example1: Chinese script first then Latin-Greek-Arabic (or even more...)

Example2: Greek script first then Latin-Arabic-Chinese (or even more...)

What is the most effective and pythonic way to create a sort like any of these? (by «any» I mean either the simple «selected script first» and rest as in unicode sort, or the more complicated «selected script first» and then a specified order for rest of the scripts)

ilias iliadis
  • 601
  • 8
  • 15

1 Answers1

2

Interesting question. Here’s some sample code that classifies strings according to the writing system of the first character.

import unicodedata

words = ["Japanese",         # English
         "Nihongo",          # Japanese, rōmaji
         "にほんご",          # Japanese, hiragana
         "ニホンゴ",          # Japanese, katakana
         "日本語",            # Japanese, kanji
         "Японский язык",    # Russian
         "जापानी भाषा"        # Hindi (Devanagari)
]

def wskey(s):
    """Return a sort key that is a tuple (n, s), where n is an int based
    on the writing system of the first character, and s is the passed
    string. Writing systems not addressed (Devanagari, in this example)
    go at the end."""

    sort_order = {
        # We leave gaps to make later insertions easy
        'CJK' : 100,
        'HIRAGANA' : 200,
        'KATAKANA' : 200,  # hiragana and katakana at same level
        'CYRILLIC' : 300,
        'LATIN' : 400
    }

    name = unicodedata.name(s[0], "UNKNOWN")
    first = name.split()[0]
    n = sort_order.get(first, 999999);
    return (n, s)

words.sort(key=wskey)
for s in words:
    print(s)

In this example, I am sorting hiragana and katakana (the two Japanese syllabaries) at the same level, which means pure-katakana strings will always come after pure-hiragana strings. If we wanted to sort them such that the same syllable (e.g., に and ニ) sorted together, that would be trickier.

Tom Zych
  • 13,329
  • 9
  • 36
  • 53