Is it possible to sort a huge text file using Linux sort command by a number at the end of each line?

Question

I am trying to sort a text file where the lines are in the following format:

! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6

and want to sort numerically descending by the number at the end (i.e 6 in this example). The lines do not have a predicable number of columns using space as a delimiter, but using ||| as a delimiter there are always 5 columns, and the final column always has 3 space delimited numbers, the last of which to sort by. The text file is around 15gb and I did have a perl script I wrote to do it but it only worked on my old laptop which had 32gb of RAM because perl loads the whole file at once. Now I am stuck with 8gb RAM and it just churns the swap file for days. I have heard that the standard linux sort command handles huge files more gracefully but I can't find a way to make it use the number at the end.

[This](http://stackoverflow.com/questions/3222810/sorting-on-the-last-field-of-a-line) question provides several solutions including one in perl. — devnull, Sep 20 '13 at 12:52

fedorqui · Accepted Answer · 2013-09-20T12:52:00.853

Maybe it is a bit tricky, but this mix of commands can make it:

awk '$1=$NF" "$1' file | sort -n | cut -d' ' -f2-

The main idea is that we print the file appending the last value in the front of the line, then we sort and we finally remove that value from the output.

awk '$1=$NF" "$1' file As the parameter you want to sort by is the last one in the file, let's print it also in the first field.
sort -n Then we pipe to sort -n, which sorts numerically.
cut -d' ' -f2- and we finally print out the value we temporally used.

Test

$ cat a
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2-
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89

Showing each step:

$ awk '$1=$NF" "$1' a 
6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n
6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2-
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89

The `--batch-size=N` option can also help for gigantic inputs. This loads at most N records into memory, and uses tempfiles for merging. — amon, Sep 20 '13 at 12:50
If the intent is to reduce memory load, will not this require loading the whole file into memory? — TLP, Sep 20 '13 at 13:36
@TPL yes you are right. I guess the mixure of "big file" and "strange data" does not allow room for a much better solution. Hope somebody will come with something better than mine, though! — fedorqui, Sep 20 '13 at 13:45
Looks like the solution thanks! Only one thing, I ran this and it took ~1.5 hours with --parallel=6 set which is absolutely fine for me, speed is not really important and at least its not taking days now! The sort order was ascending though and I need the highest value first - what is the parameter to the sort program? Is it -r to get descending? In terms of memory load, it didn't use more than a few hundred mb, but I need to get is descending to confirm the result. — Dave, Sep 20 '13 at 22:13
Nice to read it worked! Yes, `sort -rn` will reverse the result, @Dave — fedorqui, Sep 21 '13 at 04:18

score 1 · Answer 2 · answered Sep 20 '13 at 12:56

1

It seems that you want to order the file according to the last number, right?

So you can duplicate the last field at the start of the line with awk

awk -F, '{ print $NF, $0 }' prova

then sort the file with

sort -n -k1

and finally remove the fake first field:

sed 's/^[0-9][0-9]* //'

Here is the script:

awk -F, '{ print $NF, $0 }' prova | sort -n -k1 | sed 's/^[0-9][0-9]* //'

answered Sep 20 '13 at 12:56

Luca Mastrostefano

3,201
2
27
34

It seems like this would also require loading the entirety of the file into memory at once. – AKHolland Sep 20 '13 at 14:25
No, it doesn't. Try to read [here](http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file) – Luca Mastrostefano Sep 22 '13 at 22:27

TLP · Answer 3 · 2013-09-20T14:48:58.417

0

Since the problem is RAM, perhaps you can reduce the memory required by using Tie::File. It will allow you to refer to a line by its index in an array. You can get the numbers to sort by and use a Schwartzian transform to get a sorted list of indexes, and then simply reprint the file at the end.

use strict;
use warnings;
use Tie::File;

my $file = shift;                           # your filename argument
tie my @lines, 'Tie::File', $file or die $!;
my @list = map $_->[0],                     # restore line number
           sort { $b->[1] <=> $a->[1] }     # sort on captured number
           map { [ $_, $lines[$_] =~ /(\d+)$/ ] } 0 .. $#lines;
           # store an array ref [ ... ] containing line number and number to 
           # sort by
@lines = @lines[@list];

The last operation will save the file in the sorted order. Note that this is a permanent change, so make backups. It is also an expensive operation, probably, and Tie::File has had some performance issues. Another way to do it, that is probably less expensive is to simply iterate over the list of numbers and printing line by line to a new file:

open my $fh, ">", "output.csv" or die $!;
for my $num (@list) {
    print $fh $lines[$num], $/;
}

This printing directly to a file circumvents any shell caching required by redirecting output

edited Sep 20 '13 at 14:48

answered Sep 20 '13 at 13:35

TLP

66,756
10
92
149

how about array of `[line_number, file_seek_position, sort_num]`? downside is many file seeks when writing sorted file. – mpapec Sep 20 '13 at 13:59
I don't think it is viable to store file positions. You would have to store line length and do a lot of math which is already handled in normal line by line mode. – TLP Sep 20 '13 at 14:40
wander how `Tie::File` handles this. – mpapec Sep 20 '13 at 14:41
@mpapec - Not too well at all: [Re: Modify values of tied, split lines in a file](http://perlmonks.org/index.pl?node_id=1000412) – Kenosis Sep 20 '13 at 17:43
@Kenosis There is only one write command here, though, the rest is read. I thought it was an interesting option for the OP to try out. – TLP Sep 20 '13 at 19:11

score 0 · Answer 4 · answered Sep 20 '13 at 14:16

Assuming I'm allowed to ruin the original file (make a copy otherwise), you can use sort on the last column by rolling through the file once and turning the last column into a predictable column number. I'm using the @ symbol as something that I assume will not be in your data. Anything can be substituted if that's a bad assumption.

sed -i 's/ /@/g; s/@\([^@]*\)$/ \1/;' in.txt
# the file now looks like "!@!@|||@whatever@||| 6"
sort --buffer-size=1G -nk 2 in.txt | sed 's/@/ /g' > sorted.txt

Is it possible to sort a huge text file using Linux sort command by a number at the end of each line?

4 Answers4

Test