5

I need to sort some data with unix sort but I can't figure exactly the right syntax, the data looks like

3.9.1 Step 10:
3.9.1 Step 20:
3.8.10 Step 20:
3.10.2 Step 10:
3.8.4 Step 90:
3.8.4 Step 100:
3.8.4 Step 10:

I want to sort it using first the major number, then the step number, e.g. the data sorted above would look like.

3.8.4 Step 10:
3.8.4 Step 90:
3.8.4 Step 100:
3.8.10 Step 20:
3.9.1 Step 10:
3.9.1 Step 20:
3.10.2 Step 10:

I have found the way to sort by first number on this site:

sort -t. -k 1,1n -k 2,2n -k 3,3n

but I am struggling to now sort by the 3rd column Step number without disturbing the first sort

Steve
  • 51,466
  • 13
  • 89
  • 103
jdex
  • 1,279
  • 1
  • 13
  • 20

4 Answers4

2

How about transforming the Step and : on the way into sort, and then transforming back afterwards? I believe this gets the results you're looking for:

cat your-file.txt \
    | sed -e 's/ Step \(.*\):$/.\1/g' \
    | sort -t. -k1,1n -k2,2n -k3,3n -k4,4n \
    | sed -e 's/\(.*\)\.\(.*\)$/\1 Step \2:/g'

(Just using cat here for expository purposes. If it's just a regular file, then it could be passed to the first sed.)

danfuzz
  • 4,253
  • 24
  • 34
  • I was hoping for a neater solution using only sort but I guess this would work also. +1 up will see if anyone else knows a different way – jdex Jul 12 '12 at 03:30
2

There's a fascinating article on re-engineering the Unix sort ('Theory and Practice in the Construction of a Working Sort Routine', J P Linderman, AT&T Bell Labs Tech Journal, Oct 1984) which is not, unfortunately, available on the internet, AFAICT (I looked a year or so ago and did not find it; I looked again just now, and can find references to it, but not the article itself). Amongst other things, the article demonstrated that for Unix sort, the comparison time far outweighs the cost of moving data (not very surprising when you consider that the comparison has to compare fields determined per row, but moving 'data' is simply a question of switching pointers around). One upshot of that was that they recommend doing what danfuzz suggests; mapping keys to make comparisons easy. They showed that even a simple scripted solution could save time compared with making sort work really hard.

So, you could think in terms of using a character that's unlikely to appear in the data file naturally (such as Control-A) as the key field separator.

sed 's/^\([^.]*\)[.]\([^.]*\)[.]\([^ ]*\) Step \([0-9]*\):.*/\1^A\2^A\3^A\4^A&/' file |
sort -t'^A' -k1,1n -k2,2n -k3,3n -k4,4n |
sed 's/^.*^A//'

The first command is the hard one. It identifies the 4 numeric fields, and outputs them separated by the chosen character (written ^A above, typed as Control-A), and then outputs a copy of the original line. The sort then works on the first four fields numerically, and the final sed commands strips off the front of each line up to and including the last Control-A, giving you the original line back again.

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • @FrankComputer: related, definitely - and it cites Linderman. But not the same thing. See also [Choosing a Pivot for a QuickSort](http://stackoverflow.com/questions/164163/choosing-a-pivot-for-quicksort/164183#164183), where I mention the Bentley paper you ask about, and some others. – Jonathan Leffler Jul 12 '12 at 23:40
  • Was able to get a brief preview here: http://books.google.com/books?id=Hy62AAAAIAAJ&q=Linderman#search_anchor – Joe R. Jul 12 '12 at 23:54
  • For those with academic research access (or *cough* Sci-Hub), the DOI is 10.1002/j.1538-7305.1984.tb00067.x – jkmartindale Jul 31 '18 at 15:11
2

This might work for you:

 sort -k3,3n file | sort -nst. -k1,1 -k2,2 -k3,3

or a very iffy:

 sort -nt. -k1,1 -k2,2 -k3,3 -k3.7 file

The first uses two sorts:

  1. sort -k3,3n sorts by steps
  2. sort -nst. -k1,1 -k2,2 -k3,3 sorts by major numbers but keeps the step order

The second works but only if the 3rd major number remains below 100.

or perhaps:

sed 's/ /./2' file | sort -nt. -k1,1 -k2,2 -k3,3 -k4,4 | sed 's/\./ /3'
potong
  • 55,640
  • 6
  • 51
  • 83
  • I think the first one would work, but the version of sort I'm using on solaris 10 doesn't have the -s option. – jdex Jul 12 '12 at 23:34
  • @jdex sorry I guess `-s` is a GNU feature. The `sed` solution may help – potong Jul 13 '12 at 00:00
1

UPDATED:

This will generate the output you specified:

sed 's/Step /Step./' data|sort -t. -n -k1,1 -k2,2 -k3,3 -k4|sed 's/Step./Step /'

result:

3.8.4 Step 10:
3.8.4 Step 90:
3.8.4 Step 100:
3.8.10 Step 20:
3.9.1 Step 10:
3.9.1 Step 20:
3.10.2 Step 10:

The challenge with this sort is that the sorting fields are defined by both '.' (for the version numbers) and the default whitespace (for the Step numbers). You can't specify several/different field separators for the same sort command. Combining several sorts with different field separators did not yield the right output.

This solution works by replacing the blank space after the Step field temporarily with a '.' so that all sorting fields can be separated with the same character ('.'). After the sort is done, the '.' is replaced with a blank again.

Levon
  • 138,105
  • 33
  • 200
  • 191
  • @jdex I found a solution I believe, please see if this is an acceptable answer for your problem. – Levon Jul 12 '12 at 11:50
  • +1 up, I really wanted to avoid modifying the data because what I provided isn't the full dataset. There is a string description of each step (sometimes also containing "Step"). it's starting to look like theres no other way though – jdex Jul 13 '12 at 00:05
  • @jdex This really puzzled me (literally), so after I said it couldn't be done I thought about it some more, and more. I still don't think `sort` alone will do just because of the different field separators involved, but it was a challenging problem so solve. You could always post a more representative set of data to make sure you get a working solution for all of your cases. I am sure some of the solutions on this page could be tweaked for that. – Levon Jul 13 '12 at 00:09