1

I'm trying to sort a csv file that looks like this:

filename,field1,field2
10_somefile,0,0
1_somefile,0,0
2_somefile,0,0
3_somefile,0,0
4_somefile,0,0
5_somefile,0,0
6_somefile,0,0
7_somefile,0,0
8_somefile,0,0
9_somefile,0,0

I've referenced code from another thread:

with open(outfname, "rb") as unsorted_file:
    csv_f = csv.reader(unsorted_file)
    header = next(csv_f, None)
    sorted_data = sorted(csv_f, key=operator.itemgetter(0))

with open(outfname, 'wb') as sorted_file:
    csv_f = csv.writer(sorted_file, quoting=csv.QUOTE_ALL)
    if header:
        csv_f.writerow(header)
    csv_f.writerows(sorted_data)

However, this won't move the '10_somefile' to the end. How can I sort this such that it uses the number before the underscore as the sorting field?

Colin
  • 415
  • 3
  • 14
  • Is the format always "X_somefile...." or the "somefile" part of the data can change too in the same sequence? Like "1_filea...", "1_fileb..". If the latter, the correct output should be "1_filea, 2_filea, 1_fileb,...." or "1_filea, 1_fileb, 2_filea,...". This will show whether you need to sort just on the integers or on the remaining parts of the string. – Lauro Moura Apr 04 '16 at 18:01

3 Answers3

1

This is happening because "10" < "1_". You want to compare integers, not strings. This behavior can be achieved by creating an integer for each line using the characters up to the underscore. Say you can get a string s (which may be done using the itemgetter as you are currently doing). Then, the following lambda (when passed as key for sorted) will do what you want.

key=lambda s: int(s[: (s.index('_'))])))

What this function does is simple: it just returns the integer made up from the characters of s up to, but not including, the first underscore.

Bernardo Sulzbach
  • 1,293
  • 10
  • 26
  • I also tried this and I get the error: '_' is not in list. I suspect it's due to the format my list is in: [['10_somefile','0','0'],['1_somefile','0,'0'],...]. – Colin Apr 05 '16 at 04:11
  • @Colin you're just doing it in the wrong place. `s` in my snippet should be a string, not a list. You can slice strings on Python. Both .index() and [] are meant to be done in a string. As your error message tells you, you are doing them on a list instead. – Bernardo Sulzbach Apr 05 '16 at 06:11
1

The key argument to sorted is returning the first element of each row as a string, making "10..." come before "1_...". You need to use "natural sorting" instead of this raw sorting.

Check How to correctly sort a string with a number inside?

Community
  • 1
  • 1
Lauro Moura
  • 750
  • 5
  • 15
1

Assuming that all your filename fields start off with a number, the simplest thing you can do is to sort by the integer by parsing it out of the filename.

# Assume this is the data of the CSV after reading it in
filenames = ['10_somefile,0,0',
 '1_somefile,0,0',
 '2_somefile,0,0',
 '3_somefile,0,0',
 '4_somefile,0,0',
 '5_somefile,0,0',
 '6_somefile,0,0',
 '7_somefile,0,0',
 '8_somefile,0,0',
 '9_somefile,0,0']

# Here, we treat the first part of the filename (the number before the underscore) as the sort key.
sorted_data = sorted(filenames, key=lambda l: (int(l.partition('_')[0])))

If you output sorted_data, it should look like:

['1_somefile,0,0', '2_somefile,0,0', '3_somefile,0,0', 
 '4_somefile,0,0', '5_somefile,0,0', '6_somefile,0,0', 
 '7_somefile,0,0', '8_somefile,0,0', '9_somefile,0,0', '10_somefile,0,0']
wkl
  • 77,184
  • 16
  • 165
  • 176
  • I cannot seem to get this to work and always get the error 'list object has no attribute 'partition'. Does it require that I read my csv file into a list first? When I do that, it doesn't quite look like what you have above: [['10_somefile','0','0'],['1_somefile','0,'0'],...] – Colin Apr 05 '16 at 04:09