There's a fascinating article on re-engineering the Unix sort
('Theory and Practice in the Construction of a Working Sort Routine', J P Linderman, AT&T Bell Labs Tech Journal, Oct 1984) which is not, unfortunately, available on the internet, AFAICT (I looked a year or so ago and did not find it; I looked again just now, and can find references to it, but not the article itself). Amongst other things, the article demonstrated that for Unix sort
, the comparison time far outweighs the cost of moving data (not very surprising when you consider that the comparison has to compare fields determined per row, but moving 'data' is simply a question of switching pointers around). One upshot of that was that they recommend doing what danfuzz suggests; mapping keys to make comparisons easy. They showed that even a simple scripted solution could save time compared with making sort work really hard.
So, you could think in terms of using a character that's unlikely to appear in the data file naturally (such as Control-A) as the key field separator.
sed 's/^\([^.]*\)[.]\([^.]*\)[.]\([^ ]*\) Step \([0-9]*\):.*/\1^A\2^A\3^A\4^A&/' file |
sort -t'^A' -k1,1n -k2,2n -k3,3n -k4,4n |
sed 's/^.*^A//'
The first command is the hard one. It identifies the 4 numeric fields, and outputs them separated by the chosen character (written ^A
above, typed as Control-A), and then outputs a copy of the original line. The sort then works on the first four fields numerically, and the final sed
commands strips off the front of each line up to and including the last Control-A, giving you the original line back again.