2

I have a script to move items around and perform some basic functions on them. It relies on list.sort() to make sure the files are going to the right places.

For example I have 11 files:

A1_S1_ETC.ext
A2_S2_ETC.ext
...
...
A10_S10_ETC.ext
A11_S11_ETC.ext

The script asks for a path and output, from this I create two sorted lists using os and glob:

pathA = raw_input()
listA = list(glob.glob(os.path.join(path,'*.ext')))
listA.sort()
outp = raw_input()
outp.sort()
filen = [x.split(pathA)[1].split('_')[0] for x in listA]
filen.sort()
outp1 = [pathA + s + '/' for s in filen]
outp1.sort()

But when printed:

print listA
['A10_S10_ETC.ext', 'A11_S11_ETC.ext','A1_S1_ETC.ext',, A2_S2_ETC.ext']
print outp1
['/user/path/A1/', '/user/path/A10/', '/user/path/A11/', '/user/path/A2/']

I guess it's the '_SXX' part in the file name that's impacting the sort function? I don't care how it's sorted, as long as A1 files go into A1 directory - not just for this nomenclature but for any possible string.

Is there a way to do this - perhaps by asking the list.sort function to sort until the first underscore?

J280694
  • 21
  • 3
  • If you don't care about the actual order then just remove all the sort calls. All the lists will be ordered the same since each one is based on another using a list comprehension. – interjay Dec 12 '15 at 10:16
  • Unfortunately there's a listB that has to be used with listA and naturally it produces a different order, so there has to be consistent sorting across the board. – J280694 Dec 12 '15 at 11:43

3 Answers3

1

Sorting strings in python is a lexicographical sort. The strings are compared lexicographically. So 'A10' and 'A11' come before 'A1_'.

you can get your expect behaviour using:

lst.sort(key=lambda x: int(x.split('_')[0][1:])
Ayush
  • 3,695
  • 1
  • 32
  • 42
  • Is there a way around this? – J280694 Dec 12 '15 at 10:19
  • Is that a python 3 expression? I'm working with 2.7 and after applying that sort my print commands return invalid syntax errors. – J280694 Dec 12 '15 at 10:30
  • @J280694 it worked for me in python 2.7.10. If you can give us the traceback, maybe we can help you get it to work – Ayush Dec 12 '15 at 10:47
  • Maybe I'm misunderstanding its use, I literally changed OP to `outp1 = [pathA + s for s in filen] outp1.sort(key=lambda x: int(x.split('_')[0][1:])` After that point in my script all of the predefined lists no longer match up and nothing works after adding in the `key=lambda x: int(x.split('_')[0][1:]` `File "", line 2 print outp ^` – J280694 Dec 12 '15 at 10:51
  • @J280694 try this: `outp1.sort(key=lambda x: int(x.split('/')[-1][1:]))` – Ayush Dec 12 '15 at 10:58
  • It's still returning in the original order – J280694 Dec 12 '15 at 11:04
  • @J280694 you sure you typed it correctly? the `split` looks missing. If you did, can I see the contents of `outp1` list once more – Ayush Dec 12 '15 at 11:08
  • `outp.sort(key=lambda x: int(x.split('/')[-1][1:])) print outp ValueError: invalid literal for int() with base 10: ''` – J280694 Dec 12 '15 at 11:17
  • what does `outp` contain? – Ayush Dec 12 '15 at 11:20
  • Please check OP for the full list, but for example outp contains '/home/user/Desktop/folder/A1/' – J280694 Dec 12 '15 at 11:27
  • this should work: `outp.sort(key=lambda x: int(x.split('/')[-2][1:]))`. basically there is a ending '/', resulting in `''` as split's last element. – Ayush Dec 12 '15 at 11:29
  • So that's put the list into logical order (i.e. A1, A2, A3, ..., A10, A11) but the issue is that the file names are still ad odds order wise (i.e. listA = A10_S10_ETC.ext, A11_S11_ETC.ext, A1_S1_ETC.ext, ...). I think I actually need to sort listA based on the first _... – J280694 Dec 12 '15 at 11:41
1

What happens is that sorting is lexicographic with ordering ASCII characters according to ASCII code. Here we have ASCII code for '0' is 48 while the ASCII code for '_' is 95 - which means that '0' < '_'.

What you can do do get consistency is to supply a consistent comparison function. For example:

def mycmp(s1, s2):
    s1 = s1.split(pathA)[1].split('_')[0]        
    s2 = s2.split(pathA)[1].split('_')[0]
    return cmp(s1, s2)

outp1.sort(cmp=mycmp)

Here the thing is that you use the same transformation before comparing the strings.

This relies on that since you strip away information you may strip away too much to make the elements distinct, but in your case it would mean that two elements of outp1 would become the same anyway so it wouldn't matter here.

Otherwise you would have to apply the sort before you transform the names. Which would mean not to sort filen or outp1 (because then their order would rely on the order of listA.

skyking
  • 13,817
  • 1
  • 35
  • 57
  • Ah okay, that makes sense. In my head the only way I can think to combat this would be to include the '_' in directory names and perhaps remove it afterwards - any better ideas? – J280694 Dec 12 '15 at 10:22
1

What you want is called natural sort. See this thread about it: Does Python have a built in function for string natural sort?

Community
  • 1
  • 1
zvone
  • 18,045
  • 3
  • 49
  • 77