0

I have very large data which I want to sort on

  • column 1: numerically and then alphanumerically
  • then on column 2: numerically.

So, my final output would be something like this:

1    11  
1    13
1    15
2    3
2    5
chr2   6
chr2   15
chr15   3
chr15   9

I am using sort on unix. But, I either keep getting chr2 on the top or on the bottom with any sort I try. Here are some of the sort I tried: which fail to give me the desired output:

sort -V -k1,1n -k2n final_merged.txt > merged-sort.txt
sort -k1,1n -k2n final_merged.txt > merged-sort.txt 
sort -k1,1h -k2n final_merged.txt > merged-sort.txt
sort -k1,1 -k2n final_merged.txt > merged-sort.txt

Post edit: Any way to fix this issue without overloading the memory while using

  • sort or other unix utilities
  • python

Thanks,

everestial007
  • 6,665
  • 7
  • 32
  • 72
  • Is the prefix in column 1 always `chr` (or at least the same number of characters)? – chepner May 31 '18 at 23:36
  • it could be different. This sorting has been giving me trouble all day. I read the several sort tutorials but can not fix this. – everestial007 May 31 '18 at 23:40
  • @chepner : any way it is possible to fix this up. – everestial007 Jun 01 '18 at 00:01
  • 1
    you want a numeric sort, but 'chr2' is not a number. you need a preprocessing step of splitting the first column into 2 columns, one of 'chr' (or blank:'' in the case of just a number) and then then number. probably sed or awk can do this – Evan Benn Jun 01 '18 at 00:11

3 Answers3

2

Try:

sort -k1,2 -V final_merged.txt

Running this using your sample data gives me:

1    11
1    13
1    15
2    3
2    5
chr2   6
chr2   15
chr15   3
chr15   9
pgngp
  • 1,552
  • 5
  • 16
  • 26
0

You want a numeric sort, but 'chr2' is not a number. you need a preprocessing step of splitting the first column into 2 columns, the text part and the number part.

gawk 'match($1, /([^0-9])*([0-9]*)/, a) {print a[1], a[2], $2}' /tmp/abc | sort -t ' ' -k1,1 -k2,2n -k3,3n

use gawk to split on a regex, non numeric then numeric, then column 2 (separated by single spaces now).

Sort on single space separated columns.

gawk '{print $1 $2, $3}' to recombine the columns.

You may need to modify these to maintain whatever whitespace is needed.

Evan Benn
  • 1,571
  • 2
  • 14
  • 20
0

A Python solution:

Initialize Natural Sort.

import re

_nsre = re.compile('([0-9]+)')
def natural_sort_key(s):
    return [int(text) if text.isdigit() else text.lower()
            for text in re.split(_nsre, s)]

Then sort as you wanted:

sorted_data = sorted(data, key=lambda item: (natural_sort_key(str(item[0])), item[1]))

sorting primarily on item[0] with natural sort, then on item[1] numerically.

Stephen C
  • 1,966
  • 1
  • 16
  • 30
  • OP states that the collection will not fit in memory, this python is even asking for the data to be in memory twice – Evan Benn Jun 01 '18 at 01:37