Sort numeric lines with thousand separators

Question

I want to sort a list starting with numbers (using Python3).
A lot of these numbers are with thousand separators (dots) and decimals (commas).

extract of mylist:

mylist = ['23 text', '23.130', '12 text', '1.482 text', '3,25']

I tried this:
Numeric sorting:

sorted(mylist, key=int, reverse=True) --> gives a 'not an integer' error

I tried this also:
Alphanumeric sorting:

convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in regex.split('([0-9]+)', key) ]
mysort.sort( key=alphanum_key, reverse=True )

Alphanumeric output:

['23.130', '23 text', '12 text', '3,25', '1.482 text']

expected output:

['3,25', '12 text', '23 text', '1.482 text', '23.130']

How can I sort my list with the expected output?

EDIT

If there are strings with only text p.e.

mylist = ['2 another test', '4,32', '801', '4apples', 'foo', '4,32 hi', 'apples4', '', '13.300 a test', '2apples', 'doo', '12 today']

I would like the output as below (including the empty fields):

['2 another test', '2apples', '4apples', '4,32', '4,32 hi', '12 today', '801', '13.300 a test', 'apples4', 'doo', 'foo', '']

Take a look at the natsort library. Not sure if it is what you need for this particular situation, but there is a good chance it will. — Mad Physicist, Apr 05 '16 at 16:20
@MadPhysicist: natsort gives this as output: `['1.482 text', '3,25', '12 text', '23 text', '23.130']` — Reman, Apr 05 '16 at 16:27
Did you set the localization parameters correctly? Because it uses American decimal and thousands separator, which are the opposite of the notation you are using. — Mad Physicist, Apr 05 '16 at 16:28
http://pythonhosted.org/natsort/examples.html#locale-aware-sorting-human-sorting — Mad Physicist, Apr 05 '16 at 16:31
Double and triple checked your use-case. Going to submit an issue to `natsort`. It should be able to handle this after doing something like `import locale; locale.setlocale(locale.LC_ALL, 'german')`, but does not work correctly as you said. — Mad Physicist, Apr 05 '16 at 16:45
@MadPhysicist: I first installed naturalsort but it was not the right one. then I installed natsort. Now with natsort installed it doesn't recognize the module `from natsort import natsorted ImportError: cannot import name 'natsorted'` — Reman, Apr 05 '16 at 16:45
@Reman. Interesting how did you install it and what version are you using? — Mad Physicist, Apr 05 '16 at 16:50
@Reman, yep brain fart, you can actually sort using `locale` once you have the appropriate locale installed, atof will handle the the conversion — Padraic Cunningham, Apr 05 '16 at 16:58
Just as an FYI: https://github.com/SethMMorton/natsort/issues/36. Hopefully that gets fixed. They the answer will be a single line. This is good for the library. I have used it a number of times with the default locale, never tried customizing before. — Mad Physicist, Apr 05 '16 at 17:28

alecxe · Answer 1 · 2016-04-05T16:25:38.897

You can solve it with a custom sorting function:

>>> sorted(mylist, key=lambda item: float(item.split(" ", 1)[0].replace(".", "").replace(",", ".")))
['3,25', '12 text', '23 text', '1.482 text', '23.130']

where the key function in this case splits each item by a space, gets the first item, replaces a dot with an empty string and a comma with a dot, then converts the result into float.

There are assumptions made for this solution and it works for the provided sample data, you may need to tweak/improve it to work on your real data - for example, now it would fail if it could not make the conversion to float.

Alexce, thank you very much for your answer. I've learned new things thanks to you too. — Reman, Apr 05 '16 at 18:27

Padraic Cunningham · Accepted Answer · 2016-04-06T09:48:10.970

You can actually use locale, just use locale.atof to cast after setting the locale to a suitable region:

In [6]: from locale import atof   
In [7]: import locale

In [8]: locale.setlocale(locale.LC_ALL, 'de_DE')
Out[8]: 'de_DE'

In [9]: mylist = ['23 text', '23.130', '12 text', '1.482 text', '3,250']

In [10]: sorted(mylist,key=lambda x: atof(x.split()[0]))
Out[10]: ['3,250', '12 text', '23 text', '1.482 text', '23.130']

If you can have just text, you can use a try/except, what you expect to happen for the string sort will decide what we do in the except, for now I just return float("inf") so the strings are pushed to the end:

from locale import atof
import locale

locale.setlocale(locale.LC_ALL, 'de_DE')

mylist = ['23 text', '23.130', '12 text', '1.482 text', '3,250', "foo"]


def atof_try(x):
    try:
        return atof(x.split()[0])
    except ValueError:
        return float("inf")

So if we add foo to mylist:

In [35]: mylist = ['23 text', '23.130', '12 text', '1.482 text', '3,250', "foo"]

In [36]: sorted(mylist, key=atof_try)
Out[36]: ['3,250', '12 text', '23 text', '1.482 text', '23.130', 'foo']

Ok, bar the empty string at the end this matches your expected output, the regular sort would put the empty string at the end, we can change the if it really matters:

from locale import atof
import locale

locale.setlocale(locale.LC_ALL, 'de_DE')
import re

wrong_type = object()


def atof_try(x):
    try:
        return atof(x.split()[0])
    except ValueError:
        return wrong_type


def atof_pre(x, patt=re.compile("^\d+")):
    try:
        _atof = atof_try(x)
        if _atof is not wrong_type:
            return _atof
        temp = patt.search(x)
        return int(temp.group())
    except (ValueError, IndexError, AttributeError):
        return wrong_type


def merge_types(l, out):
    for ele in l:
        if atof_pre(ele) is not wrong_type:
            yield ele
        else:
            out.append(ele)

The output:

In [3]: temp = []

In [4]: mylist[:] = sorted(merge_types(mylist, temp), key=atof_pre) + sorted(temp)

In [5]: print(mylist)
['2 another test', '2apples', '4apples', '4,32', '4,32 hi', '12 today', '801', '13.300 a test', '', 'apples4', 'doo', 'foo']

Putting the logic in a class and doing an inplace sort on the odd list and extending in place of concatenation, you can pass in lambdas to specify what to sort on and rev determines if you reveres sort or not:

from locale import atof
import re


class WeirdSort:
    def __init__(self, in_list, rev=False, which=None, other=None):
        # holds all strings that don't match the pattern we want.
        self.temp = []
        self.in_list = in_list
        self.wrong_type = object()
        # what lambda to pass as the sort key.
        self.which = which
        # split data and sort in_list.
        self.in_list[:] = sorted(self.separate_types(), key=self.atof_pre, reverse=rev)
        # sort odd strings.
        self.temp.sort(key=other, reverse=rev)
        # merge both lists.
        if rev:
            self.temp.extend(self.in_list)
            self.in_list[:] = self.temp
        else:
            self.in_list.extend(self.temp)
        del self.temp

    def atof_try(self, x):
        """Try to cast using specified locale,
           return wrong_type on failure."""
        try:
            return atof(self.which(x))
        except ValueError:
            return self.wrong_type

    def atof_pre(self, x, patt=re.compile("^\d+")):
        """Try to cast using atof initially,
           on failure,  try to pull digits from
           front of string and cast to int.
           On failure, returns wrong_type object
           which will mean "x" will be sorted using a regular sort.
        """
        try:
            _atof = self.atof_try(x)
            if _atof is not self.wrong_type:
                return _atof
            temp = patt.search(x)
            return int(temp.group())
        except (ValueError, IndexError, AttributeError):
            return self.wrong_type

    def separate_types(self):
        """Separate elements that can be cast to a float
           using atof/int/re logic and those that cannot,
           anything that cannot be sorted will be
           added to temp_list and sorted separately.
        """
        for ele in self.in_list:
            if self.atof_pre(ele) is not self.wrong_type:
                yield ele
            else:
                self.temp.append(ele)

The empty string is also now at the end.

So for the input:

import locale
locale.setlocale(locale.LC_ALL, 'de_DE')

mylist = ['2 another test', '4,32', '801', '4apples', 'foo', '4,32 hi', 'apples4', '', '13.300 a test', '2apples', 'doo', '12 today']
flat_lambda1, flat_lambda2 = lambda x: x.split()[0], lambda x: (x == "", x)
WeirdSort(mylist, True, flat_lambda1, flat_lambda2)
print(mylist)
sublst_lambda1, sublist_lambda2 = lambda x: x[0].split()[0], lambda x: (x[0] == "", x[0])
WeirdSort(mylist, False, lambda x: x.split()[0], lambda x: (x == "", x))
print(mylist)

mylist = [['3,25', 1], ['12 text', 2], ["", 5], ['23 text', 3]]
WeirdSort(mylist, True, sublst_lambda1, sublist_lambda2)
print(mylist)
WeirdSort(mylist, False, sublst_lambda1, sublist_lambda2)
print(mylist)

You get:

['', 'foo', 'doo', 'apples4', '13.300 a test', '801', '12 today', '4,32', '4,32 hi', '4apples', '2 another test', '2apples']
['2 another test', '2apples', '4apples', '4,32', '4,32 hi', '12 today', '801', '13.300 a test', 'apples4', 'doo', 'foo', '']
[['', 5], ['23 text', 3], ['12 text', 2], ['3,25', 1]]
[['3,25', 1], ['12 text', 2], ['23 text', 3], ['', 5]]

Thanks. I like the solution of alecxe as well. Using locale seems to me a bit easier. Just one more question. I tried your solution adding just a text string to my list. This gives an error. — Reman, Apr 05 '16 at 17:08
If you can have mixed types you have two options. If the text is always alpha we can handle that with an if, if that is not guaranteed then we need to use a try/except, add what you expect in such cases and I will add how to get around it — Padraic Cunningham, Apr 05 '16 at 17:09
but this gives not the output as expected. It must be `['3,250', '12 text', '23 text', '1.482 text', '23.130', 'foo']` First sorting of numbers (please see you first example) — Reman, Apr 05 '16 at 17:40
@Reman, yes sorry, I forgot to add `.split()[0]`, it will do the job now, what would should happen if `"bar"` was in there too? — Padraic Cunningham, Apr 05 '16 at 17:47
@PadraicCunningham nice use of `locale` and good error-handling mechanism! — alecxe, Apr 05 '16 at 17:50
Padraic, thank you very much! Now it works. I waited to answer you because I noted that it still gave an error. I captured lines in a file to a list and one line was empty. There was an empty field in the list and that gave an error. Will it be easy to adapt it to empty fields as well? — Reman, Apr 05 '16 at 18:37
@Reman, no prob, if you want to ignore empty lines you can filter with `if line.strip()` when you parse the lines of the file. — Padraic Cunningham, Apr 05 '16 at 18:38
It does not a sorting of fields like '4peers', '2peers' or 'apples? — Reman, Apr 05 '16 at 18:41
@Reman, so really you want all digits whether there is a space or not, i,e for the above 2,4,4 should be extracted for the sort? — Padraic Cunningham, Apr 05 '16 at 18:42
@PadraicCunningham, sorry not being clear. I just want to sort all lines as above in my example using numbers (at the start of the string). When there are no numbers any more a sort like the default sort (apples, apples2, apples4, peers) etc and leave empty lines at the end of the text. (As the alphanumeric sort indicated in my question but recognizing the thousand separators and decimals) — Reman, Apr 05 '16 at 18:45
@Reman,can you stick those elements in a list and add the expected output to your question? — Padraic Cunningham, Apr 05 '16 at 18:48
@Reman, no worries, what if there was `42apple`? do we consider all leading numbers in the sort and ignore trailing? — Padraic Cunningham, Apr 05 '16 at 19:08
Tnx. :) '42apple' after '12 today' First all numbers then sorting of non numbers. Hope you don't have to adapt too much. — Reman, Apr 05 '16 at 19:11
@Reman, I tried to incorporate all the conditions, they the code to see if we are nearly there and then we can worry about improvements and explanation! — Padraic Cunningham, Apr 05 '16 at 19:36
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/108326/discussion-between-reman-and-padraic-cunningham). — Reman, Apr 05 '16 at 19:55
@PadraicCunningham, I tried to sort a file with a letter and number at every start of line without succes: `D8 D40 D32 D4 D73 D42 D78 D41 D99 D43 D9 D87 E5 G1` --> output: `D32 D4 D40 D41 D42 D43 D73 D78 D8 D87 D9 D99 E5 G1` — Reman, Aug 19 '16 at 21:09

score 2 · Answer 3 · edited May 09 '16 at 04:24

2

Old question, but the natsort library version 5.0.0 now respects the locale's thousands separator when calling humansort:

import natsort, locale
locale.setlocale(locale.LC_ALL, 'german')
mylist = ['23 text', '23.130', '12 text', '1.482 text', '3,25']
natsort.humansorted(mylist)

edited May 09 '16 at 04:24

SethMMorton

45,752
12
65
86

answered May 09 '16 at 03:33

Mad Physicist

107,652
25
181
264

Hi, I just checked it with my list in question `mylist = ['', 'foo', 'doo', 'apples4', '13.300 a test', '801', '12 today', '4,32', '4,32 hi', '4apples', '2 another test', '2apples']` --> results: `['', '2 another test', '2apples', '4,32', '4,32 hi', '4apples', '12 today', '801', '13.300 a test', 'apples4', 'doo', 'foo']` ==> this is wrong: '4apples' must be before '4,32', '4,32 hi' – Reman Nov 21 '16 at 17:58
I tried locale 'german' and locale 'french', both the same results. – Reman Nov 21 '16 at 22:38
I would like to ask you one more question: `r=['4apples', '801', 'Foo', 'foo', 'Foo', 'bar', 'Bar']` --> humansorted(r) = `['4apples', '801', 'bar', 'Bar', 'foo', 'Foo', 'Foo']` --> humansorted(r, alg=ns.IGNORECASE) = `['4apples', '801', 'bar', 'Bar', 'Foo', 'foo', 'Foo']` How can I obtain this result? `['4apples', '801', 'Bar', 'bar', 'Foo', 'Foo', 'foo']` – Reman Nov 22 '16 at 12:12
1

@Reman I am pretty sure that the sort is stable (since this algorithm just generates keys for `sort`, which is stable based on http://stackoverflow.com/a/1915418/2988730), so you can try sorting twice: `humansorted(reverse(humansorted(r)), alg=ns.IGNORECASE)`. The first sort will group by case, then swap the order of the upper and lower case elements with `reverse`. The second sort will preserve the identical elements that only differ by case. – Mad Physicist Nov 22 '16 at 16:43
Mad Physicist, what about the sequence (my 1st comment)? – Reman Nov 23 '16 at 21:52
Mad, `so you can try sorting twice: humansorted(reverse(humansorted(r)), alg=ns.IGNORECASE)` --> this gives an error on my system (name 'reverse' is not defined) – Reman Nov 23 '16 at 21:57
1

Sorry, `reversed` is the built-in function. `reverse` is a method of `list`. – Mad Physicist Nov 24 '16 at 04:03
Thanks 'mad physicist'. Now I'm curious also to know how to resolve the sorting question (my 1st comment); `'4apples'` must be before `'4,32', '4,32 hi'`. Do you have any idea? (please see my question at the top of this page) – Reman Nov 24 '16 at 07:31
2

@Reman, It seems like you want to sort by floating point numbers. You can do the following: `humansorted(r, alg=ns.REAL)` – SethMMorton Jan 20 '17 at 05:28
@SethMMorton, you're right that is the solution but with `alg=ns.real` ignorecase (`alg=ns.ignorecase`) is not active anymore. Is it possible to use both (real & ignorecase) as key? I checked the natsort page but there was no example with multiple keys. – Reman Jan 20 '17 at 10:49
@Reman, aren't those just bit flags that you can combine with `|`? – Mad Physicist Jan 20 '17 at 14:08
1

@Reman Here is the documentation for how to combine multiple keys: http://pythonhosted.org/natsort/ns_class.html. I will make sure I add an explicit example on the README. – SethMMorton Jan 20 '17 at 16:08
The previous link will go dead in the near future. Here is the new home: natsort.readthedocs.io/en/master/ns_class.html – SethMMorton Aug 20 '17 at 02:45

Sort numeric lines with thousand separators

EDIT

3 Answers3