One of the main time consumers with Unix sort
is finding the keys; that is anything but the simple comparison operation that you tend to see in simple sorting exercises. Even finding one of the keys is quite a slow process.
So, one way to speed things up is to make it easier for the sort
to find the keys, by preprocessing the file so that the 5 keys you mention are at the front of each line, then sorting the data (maybe using the splitting and merging operations suggested by others) and then removing the keys.
For example, if you have colon-separated fields, and the sort keys are 1, 3, 7, 10, 12, and they're all regular alphabetic sorts, then you might use:
awk -F: '{print "%s:%s:%s:%s:%s:%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t: -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 |
sed 's/^[^:]*:[^:]*:[^:]*:[^:]*:[^:]*://'
You might even be able to do without the five -k
options and simply run sort -t:
. In fact, you can probably arrange to use a different delimiter altogether (maybe a control character such as ^A) to simplify the code. You separate the key fields from the main record with this alternative character:
awk -F: '{print "%s:%s:%s:%s:%s^A%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t$'\001' |
sed 's/^[^^A]*^A//'
This uses a bash
-ism (ANSI-C Quoting) in the $'\001'
argument to sort
; the ^A
items in the awk
and sed
scripts are what you get from typing Control-A, though you could also arrange for the bash
notation to provide the character too:
awk -F: '{print "%s:%s:%s:%s:%s'$'\001''%s\n", $1, $3, $7, $10, $12, $0; }' monster-file |
sort -t$'\001' |
sed "s/^[^$'\001']*$'\001'//"
(Warning: untested scripting.)
There's a fascinating article on re-engineering the Unix sort ('Theory and Practice in the Construction of a Working Sort Routine', J P Linderman, AT&T Bell Labs Tech Journal, Oct 1984) that's not readily available (I've not found it on the Internet despite several attempts to search for it), that describes how /bin/sort
was improved. Even after all the improvements, one of its recommendations for complex sorts was exactly along these lines.