How can I sort a list by numbers instead of string?

Question

I have this code:

import glob, os
outdir = './output/'
nstring = 'testdat_2014-12-31'
nfilelist = sorted(glob.glob((outdir+'/*{}*.nc').format(nstring)))

from which I get nfilelist:

['testdat_2014-12-31-21_H1.nc',
 'testdat_2014-12-31-21_H10.nc',
 'testdat_2014-12-31-21_H11.nc',
 'testdat_2014-12-31-21_H12.nc',
 'testdat_2014-12-31-21_H2.nc',
 'testdat_2014-12-31-21_H3.nc',
 'testdat_2014-12-31-21_H4.nc',
 'testdat_2014-12-31-21_H5.nc',
 'testdat_2014-12-31-21_H6.nc',
 'testdat_2014-12-31-21_H7.nc',
 'testdat_2014-12-31-21_H8.nc',
 'testdat_2014-12-31-21_H9.nc']

The H1-H12 numbers at the end reflect how I want to sort it. But right now, H10-H12 is sandwiched in the middle. How can I sort from H1-H12?

Regex isn't my strong suit and I'm unable to move forward.

I tried splitting and got this far:

nfilelist[0].split('_')[-1].split('.')
['H1', 'nc']

See https://stackoverflow.com/questions/5967500/how-to-correctly-sort-a-string-with-a-number-inside — Vatsal, Oct 27 '18 at 00:17
@maximusdooku do you want to sort by int value or string value? — Dani Mesejo, Oct 27 '18 at 00:24

Dani Mesejo · Accepted Answer · 2018-10-27T00:29:57.937

Assuming you want them to sort them by int value you could use regex in the following way:

import re

nfiles  = ['testdat_2014-12-31-21_H1.nc',
 'testdat_2014-12-31-21_H10.nc',
 'testdat_2014-12-31-21_H11.nc',
 'testdat_2014-12-31-21_H12.nc',
 'testdat_2014-12-31-21_H2.nc',
 'testdat_2014-12-31-21_H3.nc',
 'testdat_2014-12-31-21_H4.nc',
 'testdat_2014-12-31-21_H5.nc',
 'testdat_2014-12-31-21_H6.nc',
 'testdat_2014-12-31-21_H7.nc',
 'testdat_2014-12-31-21_H8.nc',
 'testdat_2014-12-31-21_H9.nc']

result = sorted(nfiles, key=lambda x: int(re.search('H(\d+)\.nc', x).group(1)))

print(result)

Output

['testdat_2014-12-31-21_H1.nc', 'testdat_2014-12-31-21_H2.nc', 'testdat_2014-12-31-21_H3.nc', 'testdat_2014-12-31-21_H4.nc', 'testdat_2014-12-31-21_H5.nc', 'testdat_2014-12-31-21_H6.nc', 'testdat_2014-12-31-21_H7.nc', 'testdat_2014-12-31-21_H8.nc', 'testdat_2014-12-31-21_H9.nc', 'testdat_2014-12-31-21_H10.nc', 'testdat_2014-12-31-21_H11.nc', 'testdat_2014-12-31-21_H12.nc']

Explanation

The pattern 'H(\d+)\.nc' means match any group of digits (\d+) preceded by an H and followed by .nc. and use .group(1) to get the group of digits. Then transform the groups of digits into an int and use them as a key for sorted.

No regex

If you want to avoid regex altogether use the following function as key:

def key(element):
    digits = (ix for ix in element.split('_')[-1] if ix.isdigit())
    return int(''.join(digits))

result = sorted(nfiles, key=key)

print(result)

Note

Finally if you want to sort by the string value simply remove the calls to the int function.

MarianD · Answer 2 · 2018-11-16T15:58:43.787

0

Instead of sorted() function use the natsorted() one from the natsort module:

import natsort        # pip install natsort

nfilelist = natsort.natsorted(glob.glob((outdir+'/*{}*.nc').format(nstring)))

(The name natsort means natural sort - as opposed to the lexicographical one.)

edited Nov 16 '18 at 15:58

answered Oct 27 '18 at 01:57

MarianD

13,096
12
42
54

score 0 · Answer 3 · answered Oct 27 '18 at 04:09

The names that you sort have a simple and regular structure; you can survive without invoking regex. Sort the names by taking the first part of a name after the "_H", then the first part of it before the ".", and converting the result to an integer:

sorted(nfilelist, 
       key=lambda x: int(x.split("_H")[1].split(".")[0]))
#['testdat_2014-12-31-21_H1.nc', 'testdat_2014-12-31-21_H2.nc', 
# 'testdat_2014-12-31-21_H3.nc', 'testdat_2014-12-31-21_H4.nc', 
# 'testdat_2014-12-31-21_H5.nc', 'testdat_2014-12-31-21_H6.nc', 
# 'testdat_2014-12-31-21_H7.nc', 'testdat_2014-12-31-21_H8.nc', 
# 'testdat_2014-12-31-21_H9.nc', 'testdat_2014-12-31-21_H10.nc', 
# 'testdat_2014-12-31-21_H11.nc', 'testdat_2014-12-31-21_H12.nc']

This solution will fail if you have a file named testdat_2018-12-31-21_H0.nc ;) — lakshayg, Oct 27 '18 at 04:29
@LakshayGarg According to the OP, this is not possible: `nstring = 'testdat_2014-12-31'`. — DYZ, Oct 27 '18 at 04:56

score 0 · Answer 4 · answered Oct 27 '18 at 05:12

You can achieve this without using a regex

result = sorted(nfilelist, key=lambda x: (len(x), x))

This key first compares these filenames with the idea that

Longer numbers are larger
If numbers are the same length then comparing numbers or strings is the same

Speed comparison with other answers here:

| Method            | Timing                       |
+-------------------+------------------------------+
| Using natsort     | 219 µs  ± 1.13 µs per loop   |
| Daniel's regex    | 14.2 µs ± 434  ns per loop   |
| Daniel's no-regex | 14.2 µs ± 101  ns per loop   |
| DYZ's split based | 7.50 µs ± 240  ns per loop   |
| This answer       | 2.77 µs ± 46.6 ns per loop   |

Timings were obtained using %timeit in iPython3.7 running on 2.7 GHz Intel Core i7

How can I sort a list by numbers instead of string?

4 Answers4