0

I have this code:

import glob, os
outdir = './output/'
nstring = 'testdat_2014-12-31'
nfilelist = sorted(glob.glob((outdir+'/*{}*.nc').format(nstring)))

from which I get nfilelist:

['testdat_2014-12-31-21_H1.nc',
 'testdat_2014-12-31-21_H10.nc',
 'testdat_2014-12-31-21_H11.nc',
 'testdat_2014-12-31-21_H12.nc',
 'testdat_2014-12-31-21_H2.nc',
 'testdat_2014-12-31-21_H3.nc',
 'testdat_2014-12-31-21_H4.nc',
 'testdat_2014-12-31-21_H5.nc',
 'testdat_2014-12-31-21_H6.nc',
 'testdat_2014-12-31-21_H7.nc',
 'testdat_2014-12-31-21_H8.nc',
 'testdat_2014-12-31-21_H9.nc']

The H1-H12 numbers at the end reflect how I want to sort it. But right now, H10-H12 is sandwiched in the middle. How can I sort from H1-H12?

Regex isn't my strong suit and I'm unable to move forward.

I tried splitting and got this far:

nfilelist[0].split('_')[-1].split('.')
['H1', 'nc']
maximusdooku
  • 5,242
  • 10
  • 54
  • 94

4 Answers4

3

Assuming you want them to sort them by int value you could use regex in the following way:

import re

nfiles  = ['testdat_2014-12-31-21_H1.nc',
 'testdat_2014-12-31-21_H10.nc',
 'testdat_2014-12-31-21_H11.nc',
 'testdat_2014-12-31-21_H12.nc',
 'testdat_2014-12-31-21_H2.nc',
 'testdat_2014-12-31-21_H3.nc',
 'testdat_2014-12-31-21_H4.nc',
 'testdat_2014-12-31-21_H5.nc',
 'testdat_2014-12-31-21_H6.nc',
 'testdat_2014-12-31-21_H7.nc',
 'testdat_2014-12-31-21_H8.nc',
 'testdat_2014-12-31-21_H9.nc']

result = sorted(nfiles, key=lambda x: int(re.search('H(\d+)\.nc', x).group(1)))

print(result)

Output

['testdat_2014-12-31-21_H1.nc', 'testdat_2014-12-31-21_H2.nc', 'testdat_2014-12-31-21_H3.nc', 'testdat_2014-12-31-21_H4.nc', 'testdat_2014-12-31-21_H5.nc', 'testdat_2014-12-31-21_H6.nc', 'testdat_2014-12-31-21_H7.nc', 'testdat_2014-12-31-21_H8.nc', 'testdat_2014-12-31-21_H9.nc', 'testdat_2014-12-31-21_H10.nc', 'testdat_2014-12-31-21_H11.nc', 'testdat_2014-12-31-21_H12.nc']

Explanation

The pattern 'H(\d+)\.nc' means match any group of digits (\d+) preceded by an H and followed by .nc. and use .group(1) to get the group of digits. Then transform the groups of digits into an int and use them as a key for sorted.

No regex

If you want to avoid regex altogether use the following function as key:

def key(element):
    digits = (ix for ix in element.split('_')[-1] if ix.isdigit())
    return int(''.join(digits))

result = sorted(nfiles, key=key)

print(result)

Note

Finally if you want to sort by the string value simply remove the calls to the int function.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0

Instead of sorted() function use the natsorted() one from the natsort module:

import natsort        # pip install natsort

nfilelist = natsort.natsorted(glob.glob((outdir+'/*{}*.nc').format(nstring)))

(The name natsort means natural sort - as opposed to the lexicographical one.)

MarianD
  • 13,096
  • 12
  • 42
  • 54
0

The names that you sort have a simple and regular structure; you can survive without invoking regex. Sort the names by taking the first part of a name after the "_H", then the first part of it before the ".", and converting the result to an integer:

sorted(nfilelist, 
       key=lambda x: int(x.split("_H")[1].split(".")[0]))
#['testdat_2014-12-31-21_H1.nc', 'testdat_2014-12-31-21_H2.nc', 
# 'testdat_2014-12-31-21_H3.nc', 'testdat_2014-12-31-21_H4.nc', 
# 'testdat_2014-12-31-21_H5.nc', 'testdat_2014-12-31-21_H6.nc', 
# 'testdat_2014-12-31-21_H7.nc', 'testdat_2014-12-31-21_H8.nc', 
# 'testdat_2014-12-31-21_H9.nc', 'testdat_2014-12-31-21_H10.nc', 
# 'testdat_2014-12-31-21_H11.nc', 'testdat_2014-12-31-21_H12.nc']
DYZ
  • 55,249
  • 10
  • 64
  • 93
0

You can achieve this without using a regex

result = sorted(nfilelist, key=lambda x: (len(x), x))

This key first compares these filenames with the idea that

  1. Longer numbers are larger
  2. If numbers are the same length then comparing numbers or strings is the same

Speed comparison with other answers here:

| Method            | Timing                       |
+-------------------+------------------------------+
| Using natsort     | 219 µs  ± 1.13 µs per loop   |
| Daniel's regex    | 14.2 µs ± 434  ns per loop   |
| Daniel's no-regex | 14.2 µs ± 101  ns per loop   |
| DYZ's split based | 7.50 µs ± 240  ns per loop   |
| This answer       | 2.77 µs ± 46.6 ns per loop   |

Timings were obtained using %timeit in iPython3.7 running on 2.7 GHz Intel Core i7

lakshayg
  • 2,053
  • 2
  • 20
  • 34