4

I'm trying to sort a string of letters and numbers alphanumerically in an "intuitive"/natural way using the unix sort command, but cannot get it to sort properly. I have this file:

$ cat ~/headers 
@42EBKAAXX090828:6:100:1699:328/2
@42EBKAAXX090828:6:10:1077:1883/2
@42EBKAAXX090828:6:102:785:808/2

I'd like to sort it alphanumerically, where intuitively @42EBKAAXX090828:6:10:... is first (since 10 is smaller than 100 and 102), second is @42EBKAAXX090828:6:100... and third is @42EBKAAXX090828:6:102:204:1871/2.

I know that suggest sorting on a particular position within the line, but the position of the : here could vary and so this would not be a general and workable solution here.

I tried:

sort --stable -k1,1 ~/headers > foo

with various combinations of -n and -u parameters but it does not give the correct ordering.

How can this be done efficiently, either from bash using sort or from Python? I'd like to apply this to files that are round 4-5 GB in size, so containing millions of lines.

Thanks!

Lesmana
  • 25,663
  • 9
  • 82
  • 87
  • 1
    FYI, this is usually called "Natural Sorting". – yak Dec 06 '11 at 04:42
  • Not sure about performance, but here's an implementation of natural sort in python: http://stackoverflow.com/q/4836710/331473 – Adam Wagner Dec 06 '11 at 04:45
  • how would you handle `@42EBKAAXX09082*7*:6:100:1699:328/2` and `@42EBKAAXX09082*8*:6:100:1699:328/2` (`*`s for emphasis)? are they sorted the same? (i.e. only the 3rd field is relevant) then @JonathanM's answer is best. Otherwise have a look at mine – tobyodavies Dec 06 '11 at 04:56

3 Answers3

11

the -V option appears to do what you want - natural sorting. Intended for version numbers apparently (hence the letter chosen)

sort -V ~/headers

outputs

@42EBKAAXX090828:6:10:1077:1883/2
@42EBKAAXX090828:6:100:1699:328/2
@42EBKAAXX090828:6:102:785:808/2
tobyodavies
  • 27,347
  • 5
  • 42
  • 57
4

It is sorting it alphabetically as it is in your example. The reason 10: is coming after 100 and 102 is that 10: is after them, since the colon : is after the 9 character in the ASCII chart.

If you're wanting to sort on the third field delimited by a colon, try this:

sort -t':' -k3 ~/headers > foo
Jonathan M
  • 17,145
  • 9
  • 58
  • 91
  • Good answer if OP only ever wants to sort on that field – tobyodavies Dec 06 '11 at 04:54
  • Probably better to use `-k3n` or `-k3,4n` so that `9` sorts before `10`. There's room to think that the OP might want '@43ZQRY101112:6:19:221:134/3' to sort after the rows shown rather than in second place, so the sort probably needs to be on more keys than just the third. It would be interesting to know whether the data '@6NBGD010101:9:99:999:111/3' or '@213QED081231:16:91:23:2/0' could appear, and where these should appear relative to the rows starting '@42E'. The problem is, as yet, under-specified because we don't have a full picture of the variability of the incoming data. – Jonathan Leffler Dec 06 '11 at 05:08
  • @JonathanLeffler, quite possible. Good comment. Thanks. – Jonathan M Dec 06 '11 at 05:11
  • @JonathanLeffler: thanks for pointing this out. To be clear, the order is to treat the ':' as if they were orders of magnitude, from high to low, in decimal numbers. So "6:500" precedes "7:10" and "2:200" precedes "6:500". This is why I don't think the solution of sorting on particular colon fields number will work. Is there an alternative to that? thanks. –  Dec 06 '11 at 05:21
  • the truest answer to this is that I'd like to be identical to how the program "samtools" with "samtools sort -n" option sorts, but unfortunately I cannot find a precise description of what sorting procedure it uses anywhere... does any one know?? –  Dec 06 '11 at 05:23
  • 1
    @user248237: Do you care about the relative order of the fields before the first colon? Are they the most or least important sort fields? And is a 'straight codeset order' sort OK for the first field? If you only need the 4 fields after the first colon up to the slash sorted in numerical order, then `sort -t: -k2,3n -k3,4n -k4,5n -k5,6n` should do the trick. If you need to treat the first field specially, then it gets more complex. – Jonathan Leffler Dec 06 '11 at 05:36
  • There's a beautiful paper 'Theory and Practice in the Construction of a Working Sort Routine' by J P Linderman. At the end, it shows that the best (or, at least, a good) way to improve the performance of sort on complex keys is to write transformation code that puts the keys at the front of each line in an easily sortable format, then feeds the data to the main sort, and then strips the sorting key off the output lines. This reduces the overhead of interpreting each line, thus making the sort faster. Sadly, the paper is hard to find on the internet; Google finds references but not the paper. – Jonathan Leffler Dec 06 '11 at 05:42
  • @user248237 if you want the `:` to act as and order of magnitude separator, then see my answer. – tobyodavies Dec 06 '11 at 05:59
0

This is usually called Natural Sorting. Here's one way that works for your example data set.

import re

def natural_sorted(iterable, reverse=False):
    """Return a list sorted the way that humans expect."""
    def convert(text):
        return int(text) if text.isdigit() else text
    def natural(item):
        return map(convert, re.split('([0-9]+)', item))
    return sorted(iterable, key=natural, reverse=reverse)

I found this here and improved a bit.

yak
  • 8,851
  • 2
  • 29
  • 23
  • will this scale up relative to Unix sort? Can it work on millions of lines long files? –  Dec 06 '11 at 05:18