0

I have a list of sublists of arbitrary lengths, populated with strings, both alpha and numeric. I want to sort this list first by the first element of each sub-list, then by the second element, and so on. Numeric strings should be treated as integers or floats, so that for example '100' is placed after '12', not before.

Here’s an example input list:

[['10001', '1002', '501'],
 ['10001', '1002', '5001'],
 ['1001', '1002', '5'],
 ['1', '1002', '5'],
 ['1', '102', '6'],
 ['1', '12', '4'],
 ['10', '11', '3'],
 ['mihail', '1', '2'],
 ['1', 'mihail', '1']]

The desired output after sorting is:

[['1', '12', '4'],
 ['1', '102', '6'],
 ['1', '1002', '5'],
 ['1', 'mihail', '1'],
 ['10', '11', '3'],
 ['1001', '1002', '5'],
 ['10001', '1002', '501'],
 ['10001', '1002', '5001'],
 ['mihail', '1', '2']]

That’s the approach I have tried:

def true_numeric(string_):
    allowed = '0123456789.'
    for char in string_:
        if char not in allowed:
            return False
    point_count = sum([1 for char in string_ if char=='.'])
    if point_count > 1:
        return False
    return True
def numeric_strings_sort_key(item):
    result = []
    for element in item:
        if true_numeric(element):
            result.append((0, float(element)))
        else:
            result.append((1, element))
    return tuple(result)
my_list = [
    ['10001', '1002', '501'],
    ['10001', '1002', '5001'],
    ['1001', '1002', '5'],
    ['1', '1002', '5'],
    ['1', '102', '6'],
    ['1', '12', '4'],
    ['10', '11', '3'],
    ['mihail', '1', '2'],
    ['1', 'mihail', '1']
]

my_list.sort(key=numeric_strings_sort_key)

print(my_list)

This code uses a custom true_numeric function to determine whether a string represents a valid floating-point number and treats it as such when sorting. The numeric_strings_sort_key function uses this function to construct a sorting key tuple that correctly handles mixed data types.

However, I’m not sure if this is the best approach or if there’s a simpler or more efficient way to achieve the desired result. Is there a better way to sort a list of lists with mixed data types and an arbitrary number of levels in Python?

  • 1
    Doesn't `['1', '12', '4']` in your desired output contradict `Numeric strings should be treated as integers or floats` ??? Or do you want to sort only the top_level list? And what about the `arbitrary number of levels` ? Do you mean there can be a list of lists of lists... ? – Swifty Jun 16 '23 at 10:59
  • if sorted as strings, a hundred '100' goes before twelve '12' – Anton Bibin Jun 16 '23 at 11:02
  • arbitrary number of levels implies that it could be 2d array of any width. E.g. my_list = [['1', '1002'], ['1', '102'], ['1', '12'], ['10', '11'], ['mihail', '1'], ['1', 'mihail']] should be processed as well. There are two levels of sorting. In the example above there is 3. – Anton Bibin Jun 16 '23 at 11:04
  • Ok, then depth probably isn't the appropriate word (it suggests embedding depth). – Swifty Jun 16 '23 at 11:05
  • I especially doubt that my true_numeric() is the way to go. But # print(float('1,234.56')) # ValueError: could not convert string to float: '1,234.56' # print(float('12a3.45')) # ValueError: could not convert string to float: '12a3.45' And also # print(int('123.45')) # ValueError: invalid literal for int() with base 10: '123.45' # print(int('1,234')) # ValueError: invalid literal for int() with base 10: '1,234' # print(int('12a3')) # ValueError: invalid literal for int() with base 10: '12a3' – Anton Bibin Jun 16 '23 at 11:07
  • Though there are other methods, you could probably get rid of true_numeric and do `try: element = float(element) ; except: pass` – Swifty Jun 16 '23 at 11:10
  • Thanks! Geat idea, shorter, more adequate – Anton Bibin Jun 16 '23 at 11:13

2 Answers2

1

Writing your own key function

Since all the elements are strings, you need to tell list.sort or sorted explicitly that strings should be interpreted as integers when possible.

This can be done using the key parameter of list.sort and sorted:

def int_or_not(x):
    try:
        return ('int', int(x))
    except ValueError:
        return (type(x).__name__, x)

def k(l):
    return [int_or_not(x) for x in l]

ll = [['10001', '1002', '501'],
 ['10001', '1002', '5001'],
 ['1001', '1002', '5'],
 ['1', '1002', '5'],
 ['1', '102', '6'],
 ['1', '12', '4'],
 ['10', '11', '3'],
 ['mihail', '1', '2'],
 ['1', 'mihail', '1']]

ll.sort(key=k)

print(*ll, sep='\n')

Using library natsort

You're not the first person to want to treat numbers as numbers when they appear mixed with strings. Parsing numbers can quickly become complicated if there is more than one number per string and if you mix integers and decimals and other representations of numbers in strings. Rather than reinventing the wheel, you can use a library that already deals with all the weird edge-cases.

import natsort

ll = [['10001', '1002', '501'],
 ['10001', '1002', '5001'],
 ['1001', '1002', '5'],
 ['1', '1002', '5'],
 ['1', '102', '6'],
 ['1', '12', '4'],
 ['10', '11', '3'],
 ['mihail', '1', '2'],
 ['1', 'mihail', '1']]

ll = natsort.natsorted(ll)

print(*ll, sep='\n')
Stef
  • 13,242
  • 2
  • 17
  • 28
  • Oh indeed, I forgot about `natsort` – Swifty Jun 16 '23 at 11:41
  • @Swifty It's possible that it's overkill in this case, since all strings appear to be either full integer or not a number at all, but when the situation is even just a little more complex, it quickly becomes a great option :) Also, we don't know much about the OP's situation. It's possible that using `key` and `natsorted` are both bad ideas in the OP context, and that instead they should first ***modify*** the list so that numbers are not stored as strings. – Stef Jun 16 '23 at 11:56
  • Another question is: is there such a thing in numpy? If there is, it would be superb. Maybe too good to be true :) But any way, the natsort seems to be the wheel, which is invented already. Extremely powerful and concise – Anton Bibin Jun 16 '23 at 12:11
  • @AntonBibin In numpy no, but in pandas you can use the key functions from natsort: https://stackoverflow.com/questions/29580978/naturally-sorting-pandas-dataframe – Stef Jun 16 '23 at 12:13
  • @AntonBibin I must admit I've never used an array of strings in numpy, though. Numpy is super-duper-cool for arrays of numbers, but for strings I've never even thought about using it. – Stef Jun 16 '23 at 12:14
0

Ok I went from your approach, using the shortcut in my comment to make more compact (and hopefully more efficient, though I didn't test that) code:

my_list = [
    ['10001', '1002', '501'],
    ['10001', '1002', '5001'],
    ['1001', '1002', '5'],
    ['1', '1002', '5'],
    ['1', '102', '6'],
    ['1', '12', '4'],
    ['10', '11', '3'],
    ['mihail', '1', '2'],
    ['1', 'mihail', '1']
]

def to_numeric(element):
    try:
        return (0 ,float(element))
    except:
        return (1, element)
    
my_list.sort(key = lambda sublist: [to_numeric(element) for element in sublist])

The last statement could be rewritten thus (though I'm not sure it improves anything...):

my_list.sort(key = lambda sublist: tuple(map(to_numeric, sublist)))
Swifty
  • 2,630
  • 2
  • 3
  • 21