120

I have a CSV-like file, and I would like to sort it by column priority, like "ORDER BY" in SQL. For example, given the following rows,

3;1;2
1;3;2
1;2;3
2;3;1
2;1;3
3;2;1

If "ORDER BY" were column2, column1, column3, the result would be:

2;1;3
3;1;2
1;2;3
3;2;1
1;3;2
2;3;1

I'd like to know how to get this same result using the sort command on Unix.

Dan Dascalescu
  • 143,271
  • 52
  • 317
  • 404
Rafael Orágio
  • 1,718
  • 5
  • 19
  • 26

4 Answers4

197

You need to use two options for the sort command:

  • --field-separator (or -t)
  • --key=<start,end> (or -k), to specify the sort key, i.e. which range of columns (start through end index) to sort by. Since you want to sort on 3 columns, you'll need to specify -k 3 times, for columns 2,2, 1,1, and 3,3.

To put it all together,

sort -t ';' -k 2,2 -k 1,1 -k 3,3

Note that sort can't handle the situation in which fields contain the separator, even if it's escaped or quoted.

Also note: this is an old question, which belongs on UNIX.SE, and was also asked there a year later.


Old answer: depending on your system's version of sort, the following might also work:

sort --field-separator=';' --key=2,1,3

Or, you might get "stray character in field spec".

According to the sort manual, if you don't specify the end column of the sort key, it defaults to the end of the line.

Dan Dascalescu
  • 143,271
  • 52
  • 317
  • 404
Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • 1
    @Charlie Martin:this would not work for csvs that contain ";" within the cell, right? – user121196 May 12 '13 at 06:50
  • Right. You could work out something trating the land and right hand sides of those as separate fields. If that doesn't work, look at using tr to turn the semicolons into tabs and maybe use column numbers. – Charlie Martin May 13 '13 at 16:18
  • 12
    If the values are numeric, then you probably want consider using the `-n` option which will "compare according to string numerical value" or the `-g` option which will "compare according to general numerical value". A string comparison of numeric values will get the numbers ordered like `1,10,2,20`. At least those are options available on my version of sort on CentOS. You should verify with the man page what the correct options are on your version of sort. – Adam Porad Jun 14 '13 at 15:22
  • 8
    I get `sort: stray character in field spec: invalid field specification ‘2,1,3’` – Martin Thoma Aug 19 '14 at 16:13
  • 7
    However, `sort --field-separator=',' -r -k3 -k1 -k2 source.csv > target.csv` worked for me. – Martin Thoma Aug 19 '14 at 16:28
  • Moose, can you tell us what system you're using? Various implementations of sort have little differences. – Charlie Martin Aug 21 '14 at 18:42
  • 2
    this would never work for a real csv files that contain field separator within the column – user121196 Sep 14 '14 at 08:18
  • 1
    Oddly, that doesn't appear to be the question that was asked, now does it? – Charlie Martin Sep 15 '14 at 22:04
  • @CharlieMartin: Ubuntu 14.10. I use `/usr/bin/sort`. According to `sort --version` it is `sort (GNU coreutils) 8.23` – Martin Thoma Jan 05 '15 at 15:58
  • Well, see? What I have worked on Mac OS/X -- essentially BSD -- at the time I wrote the answer. Always read the fine man page. – Charlie Martin Jan 05 '15 at 16:00
  • how long would this take on a 500MB csv file? – Mona Jalal Nov 01 '16 at 23:48
  • Depends on your system. I would imagine tens of minutes, wouldn't be surprised at hours. – Charlie Martin Nov 02 '16 at 01:13
  • 1
    I just sorted a 6GB file in minutes - I believe `sort` is highly optimized, also for sorting large files that do not fit in RAM. – Just a student Nov 06 '17 at 13:03
  • 8
    @MartinThoma it's been a long time but I ran into your problem and I found that `sort --field-separator=';' --key={2,1,3}`. This worked in `GNU coreutils 8.4` from April 2016 – mrbolichi May 31 '18 at 09:57
  • 4
    @mrbolichi the notation `--key={2,1,3}` uses brace expansion of bash – kvantour Nov 03 '19 at 10:52
  • @mrbolichi that might have appeared to work, but is incorrect. I've edited the answer to explain that `-k` expects the start **and end** column indexes. Also note how Martin changed the column order [in their comment](https://stackoverflow.com/questions/9471101/sort-csv-file-by-multiple-columns-using-the-sort-command#comment39595676_9471139) from the original `2,1,3` to `3,1,2`. That is also incorrect. – Dan Dascalescu Apr 24 '22 at 20:51
38

Suppose you have another row 3;10;3 in your unsorted.csv file. Then I guess you expect a numerically sorted result:

2;1;3
3;1;2
1;2;3
3;2;1
1;3;2
2;3;1
3;10;3

and not an alphabetically sorted one:

2;1;3
3;1;2
3;10;3
1;2;3
3;2;1
1;3;2
2;3;1

To get that, you have to use -n:

sort --field-separator=';' -n -k 2,2 -k 1,1 -k 3,3 unsorted.csv

It is worth mentioning that 2,2 has to be used. If only 2 is used, then sort takes the string from beginning of field 2 to the end. 2,2 makes sure that only field 2 is used.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 11
    The pointer as to the difference between -k 2, and -k 2,2 is significant! I had overlooked this on my first reading of the man page. Thanks. – usonianhorizon Jul 08 '16 at 18:55
  • I added a few extra rows, `3;10;3` , `3:10:5` , `3:10;2`, `3;10;3` in that order in the source file, and when using _just_ `-k 2,2` it appears to sort on column 2 and 3. The man page says `"The -k option may be specified multiple times, in which case subsequent keys are compared when earlier keys compare equal."`. In my case the earlier key (value=10) did compare equal, however, I didn't specify `-k` multiple times. I'm not sure if this is reliable behaviour, or related to my system (mac). Ultimately it doesn't matter though, as long as the primary sorting is correct. – Davos Oct 28 '19 at 11:21
  • Oh I see there is also `-s` stable sort which ignores the equal keys, that is apparently faster according to man. – Davos Oct 28 '19 at 11:56
26

Charlie's answer above didn't work for me on Cygwin (sort version 2.0, GNU textutils), the following did:

sort -t"," -k2 -k1 -k1
Samuel Kerrien
  • 6,965
  • 2
  • 29
  • 32
-8

..and if anyone followed the 'sort' solution but now wants to get more than the single unique entry per line (i.e. the top X number of unique entries), once you've sorted the file using 'sort', you can use a little app I created here:

https://github.com/danieliversen/MiscStuff/blob/master/scripts/findTopUniques.java

Daniel Iversen
  • 469
  • 1
  • 6
  • 12
  • 2
    Good for you! But in your case, you could have just use `cat unsorted-file | sort | uniq | head -X` - when `X` is the number of first rows you wish to output. – Slavik Meltser May 31 '16 at 14:58
  • @SlavikMe Thanks a lot for the comment! However, your suggestion gives a different result.. Your suggestion gets the first X lines in the totally sorted file, whereas we wanted to get the first X lines per "key" (i.e. if you have a CSV with names, then if you sort by column 2 "last name" then your commands would perhaps only get 3 lines with "Allen" as the last name whereas ours would get "Allen", "Brittain", "Charles" etc). Thanks though! – Daniel Iversen Jun 02 '16 at 15:27
  • 6
    you are wrong. I would've suggest to try out the command I wrote before commenting. Note, that there is a command `uniq` in pipes order, between the `sort` and the `head`, which gives a uniqueness to all sorted rows just before the extraction of top rows. – Slavik Meltser Jun 05 '16 at 04:29
  • This solution does not answer the question, as it specifically requests to use command `sort`. – Tiago Cogumbreiro Sep 05 '21 at 18:11