2

I want to know how I can sort the filenames as they are in the directory. For example, I have the following names:

1_00000_6.54.csv
2_00000_1.70.csv
3_00000_1.70.csv
...
10_00000_1.70.csv
11_00000_1.70.csv
...

With the following python code I get the following order:

 def get_pixelist(path):
     return [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.csv')]

 def group_uniqmz_intensities(path):
     pxlist = sorted(get_pixelist(path))

gives:

1_00000_6.54.csv
10_00000_1.70.csv
11_00000_1.70
...
2_00000_1.70.csv
...
3_00000_1.70.csv
...

I want the order shown before.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Hocine Ben
  • 2,079
  • 2
  • 14
  • 20
  • 1
    Good question. What you're asking is sometimes referred to as the ["natural sort order"](http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html); it would make sense to make a Python `key` for that. – Kos Jan 30 '13 at 10:33

4 Answers4

2

The easiest would be to zero-pad the filenames when sorting:

def group_uniqmz_intensities(path):
    pxlist = sorted(get_pixelist(path), key=lambda f: f.rjust(17, '0'))

which will pad each filename to 17 characters with 0 characters when sorting; so 1_00000_6.54.csv is padded to 01_00000_6.54.csv while 10_00000_1.70.csv is left as is. Lexographically, 01 sorts before 10.

I picked 17 as a hardcoded value to simplify things; you could find the required value automatically by using this instead:

def group_uniqmz_intensities(path):
    padsize = max(len(f) for f in pxlist)
    pxlist = sorted(get_pixelist(path), key=lambda f: f.rjust(padsize, '0'))
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Nice one! To make sure it works properly 17 should be changed to length of the longest filename. Something like max_length = len(max(get_pixelist(path), key=lambda x: len(x)) . – Dimitri Vorona Jan 30 '13 at 10:37
  • I'd use `str.zfill` instead of `str.rjust` – Bakuriu Jan 30 '13 at 10:38
  • thanks Martijn. This is not what I want. I want the following order: 1_00000_6.54.csv 2_00000_1.70.csv 3_00000_1.70.csv ... 10_00000_1.70.csv 11_00000_1.70.csv – Hocine Ben Jan 30 '13 at 10:38
  • @HocineBen That's exactly what you obtain with this solution. – Bakuriu Jan 30 '13 at 10:40
  • @HocineBen see my comment above. Try changing 17 to 25 (or some big number) and see if it helps. – Dimitri Vorona Jan 30 '13 at 10:42
  • I get this (with the solution): 10_00000_6.54.csv 11_00000_1.70.csv 1_00000_1.70 ... 2_00000_1.70.csv ... 3_00000_1.70.csv – Hocine Ben Jan 30 '13 at 10:46
  • @Bakuriu: You could do that too; we don't need to handle signs (`-` or `+` here) so using `.rjust` is probably slightly faster. – Martijn Pieters Jan 30 '13 at 12:08
  • @HocineBen: it could be that I miscounted; use a higher number than 17 or determine the length automatically with using the `max()` line I added. – Martijn Pieters Jan 30 '13 at 12:09
  • @MartijnPieters Testing a bit with `timeit` it seems `zfill` is slightly *faster* than `rjust`. At least in python2.7.3 on linux. – Bakuriu Jan 30 '13 at 13:04
0

Since '1' < '_' you get the second ordering. You can achieve your goal by giving a key-function to sorted:

 def group_uniqmz_intensities(path):
     pxlist = sorted(get_pixelist(path), key=lambda x: int(x.split("_")[0]))

Please make sure ALL of your files are following the same naming scheme ({number}_{rest}.csv) otherwise there will be a ValueError.

EDIT: Martijn Pieters provides a more elegant solution.

Dimitri Vorona
  • 450
  • 3
  • 13
0

Based on this answer for alphanumerical sorting:

def group_uniqmz_intensities(path):
    pxlist = sorted(get_pixelist(path), key=lambda filename: int(filename.partition('_')[0]))
Community
  • 1
  • 1
BioGeek
  • 21,897
  • 23
  • 83
  • 145
0

Here's a trivial implementation of natural ordering, assuming that your fields are all split by _:

def int_if_possible(s):
    try:
        return int(s)
    except:
        return s


>>> sorted(s, key=lambda s: map(int_if_possible, s.split('_')))
['1_00000_6.54.csv',
 '2_00000_1.70.csv',
 '3_00000_1.70.csv',
 '10_00000_1.70.csv',
 '11_00000_1.70.csv']

This implementation leverages the fact that lists get compared element-by-element. If the elements are convertible to ints, we compare them as ints, otherwise we fall back to string comparison.


Edit: A more elaborate solution for natural sorting is presented here: Natural string sorting.

It's pretty clever: it uses a regex \d+\D+ to split input strings into alternating numbers and non-numbers. Then numbers are compared numerically, and non-numbers alphabetically.

Kos
  • 70,399
  • 25
  • 169
  • 233