How to sort list of string with number by grammar and number sequence?

Question

now i have a list of string with numbers which the string is a thai language like below.

mylist = ['เชียงใหม่_10_เขต', 'เชียงใหม่_1_เขต', 'เชียงใหม่_2_เขต', 'พะเยา', 'ภูเก็ต', 'กรุงเทพ']

And when i sort the list by grammar key with this code...

import pyuca
sort_key = sorted(mylist, key=pyuca.Collator().sort_key)

the character is sorted correctly but the string that have the same character but different number that not sorted by number like the output below.

['กรุงเทพ', 'เชียงใหม่_1_เขต', 'เชียงใหม่_10_เขต', 'เชียงใหม่_2_เขต', 'พะเยา', 'ภูเก็ต']

the output that i want is like this.

['กรุงเทพ', 'เชียงใหม่_1_เขต', 'เชียงใหม่_2_เขต', 'เชียงใหม่_10_เขต', 'พะเยา', 'ภูเก็ต']

So are there any way to do that.

Does this answer your question? [Is there a built in function for string natural sort?](https://stackoverflow.com/questions/4836710/is-there-a-built-in-function-for-string-natural-sort) — tevemadar, Oct 22 '20 at 08:28
@tevemadar Thank you for response, i have try with this code `natsorted(sort_key)` and the string number is sorted correctly but it change the sequence of grammar and make it goes wrong, and this is output `['กรุงเทพ', 'พะเยา', 'ภูเก็ต', 'เชียงใหม่เขต1', 'เชียงใหม่เขต2', 'เชียงใหม่เขต10']` — Kaow, Oct 22 '20 at 08:39
Could you create an english example with the same problem, so that it is easyer to help you? — Gulzar, Oct 22 '20 at 08:48
To my understanding `natsorted` is a drop-in replacement for `sorted`. Try `sort_key = natsorted(mylist, key=pyuca.Collator().sort_key)` — tevemadar, Oct 22 '20 at 09:08
@tevemadar i have try but the sequence by number still wrong, and this is the result `['กรุงเทพ', 'เชียงใหม่เขต1', 'เชียงใหม่เขต10', 'เชียงใหม่เขต2', 'พะเยา', 'ภูเก็ต']` — Kaow, Oct 22 '20 at 09:17

yatu · Accepted Answer · 2020-10-22T09:25:26.823

2

You will need to extract the digits from the end of the string and cast them to int, otherwise the sorting will be lexicographic. You could use a regex to extract the alphabetical and decimal parts separately, and sort with a tuple of extracted (word, digit) pairs:

import pyuca
import re

def sorter(s, c):
    dig = 0
    l = re.split('(\d+)', s)
    alpha = []
    for i in l:
        try:
            dig = int(i)
        except ValueError:
            alpha.append(i)
    return c.sort_key(''.join(alpha)), dig

Now if we sort using the above transformation function:

c = pyuca.Collator()
sorted(mylist, key=lambda s: sorter(s, c))

['กรุงเทพ',
 'เชียงใหม่_1_เขต',
 'เชียงใหม่_2_เขต',
 'เชียงใหม่_10_เขต',
 'พะเยา',
 'ภูเก็ต']

edited Oct 22 '20 at 09:25

answered Oct 22 '20 at 08:49

yatu

86,083
12
84
139

Thank you very much, this is very helpful but if the number that not always at the end of word like `เชียงใหม่_1_เขต`, are there any way to do that? – Kaow Oct 22 '20 at 09:03
Can you give some examples of the behaviour you expect in such case? Shhould `_เขต` account for the ordering too? @Kaow – yatu Oct 22 '20 at 09:04
for example i have this list `['เชียงใหม่_10_เขต', 'เชียงใหม่_1_เขต', 'เชียงใหม่_2_เขต', 'พะเยา', 'ภูเก็ต', 'กรุงเทพ']` and i want to sort it like this `['กรุงเทพ', 'เชียงใหม่_1_เขต', 'เชียงใหม่_2_เขต', 'เชียงใหม่_10_เขต', 'พะเยา', 'ภูเก็ต']` what i want is if other character in the word is the same but it have difference number(at anywhere in that word) then it should looking and order by number. – Kaow Oct 22 '20 at 09:12
Updated, should work now @Kaow Can you update the question too? Thx – yatu Oct 22 '20 at 09:25
1

Thank you very much, this is help me a lot :D – Kaow Oct 22 '20 at 09:34

How to sort list of string with number by grammar and number sequence?

1 Answers1