alternative for command line "sort | uniq -c | sort -n"

Question

I use sort | uniq -c | sort -n for years but today it fails as my input file is 10 GB and my /tmp is 1 GB wide:

sort: write failed: /tmp/sortmIGbL: No space left on device

Therefore I am looking for an efficient alternative for everyday use:

awk may be used but there is no sorted associative array

perl seems to be a good option but the 10-years-old solution from perlmonks.org does not seem to work

no warnings;
$^W=0;
open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!";
my ($buffer, %h) = ''; keys %h = 1024*500;
while (sysread($in, $buffer, 16384, length $buffer)) {
    $h{$1}++ while $buffer =~ m[^(?:.+?\|){9}([^|]+)\|]mg;
    $buffer = substr($buffer, rindex($buffer, "\n"));
}
print scalar keys %h;

How to get the same result as `sort | uniq -c | sort -nr | head` on very large files?

As I use Linux/Cygwin/Solaris/*BSD/... I am open to any idea (portable or not)
You are free to use the scripting language you want (awk/perl/...)

input example

a
BB
ccccc
dddddddd
a
BB
a

one of the possible outputs

    3 a
    2 BB
    1 dddddddd
    1 ccccc

Hi @alex. I think `sort -un` is an alternative for `sort | uniq`. I will give an example in my question to be more clear... — oHo, Feb 04 '14 at 19:08
That Perl code... brrr. Why start with `no warnings`, then turn off warnings by setting the `$^W` variable, and then not use any code that produces warnings. — TLP, Feb 04 '14 at 19:19
Why do you want sorted associative arrays in Awk? Unless there are many unique strings, `awk | sort -n` should already solve your problem, by producing a (probably much) smaller data file to sort. — tripleee, Feb 04 '14 at 19:22
GNU awk has sorted associatve arrays and you can specify in what way the arrays should be sorted. — Ed Morton, Feb 04 '14 at 20:49
Hi @EdMorton. Please provide an answer using `awk` sorted associative array. Cheers :) — oHo, Feb 05 '14 at 08:41
Done. No idea if it'll work efficiently enough for your large data set, just showing an awk sorted associative array as requested. — Ed Morton, Feb 05 '14 at 14:13

ThisSuitIsBlackNot · Accepted Answer · 2014-02-04T22:08:38.473

5

The first sort in your chain of commands is the one using all the resources. Reduce the problem set by getting the unique lines first, then sorting:

perl -ne '
    $count{$_}++;
    END {
        print "$count{$_} $_" for sort {
            $count{$b} <=> $count{$a} || $b cmp $a
        } keys %count
    }
' input.txt

You have 66,000 unique lines of 7 bytes, so you the memory taken up by the hash keys is going to be 66,000 * 56 bytes for each of those scalars = 3,696,000 bytes for the keys. That doesn't include the counts and the overhead of the hash, but there's no doubt this approach will easily do the trick.

edited Feb 04 '14 at 22:08

answered Feb 04 '14 at 20:10

ThisSuitIsBlackNot

23,492
9
63
110

Perfect :-) My large input contains 66000 different lines out of 1.5 million lines (7 bytes per line). Your `perl` command takes 40 seconds to complete (6 times faster than `TMPDIR=$HOME LANG=C sort | uniq -c | sort -nr`). Thank you :-) – oHo Feb 04 '14 at 20:34
2

@olibre, 1.5 million * 7 bytes is 10MB, not 10GB! Did you mean 1.5 billion lines? – ikegami Feb 04 '14 at 20:52
Yep @ikegami you are right ;) 1.5 billion lines! Too much zeros for me ;p Cheers – oHo Feb 05 '14 at 08:38
Any idea how to sort the matches with the same count number, but in ASCENDING order? – thebugfinder May 09 '14 at 16:31
i figured it out, swap $a for $b one of the terms, $count{$b} <=> $count{$a} || $a cmp $b – thebugfinder May 09 '14 at 16:41

score 3 · Answer 2 · answered Feb 04 '14 at 19:19

3

Sorting is not a sequential operation, e.g. you cannot just read 10 records in, sort them, forward them and then do the next 10 records. So if you want to sort 10GB of data you either

need lots of memory, e.g. way more then 10GB
need lots of disk space (at least 10GB) or sort in-place, e.g. inside the file (this will work for fixed-size records, but will be a nightmare for variable sized records)
need a smarter approach to your problem (e.g. if the record size is 1MB but only 10 bytes of these are relevant for sorting you can be faster and use less memory with a smart algorithm)

BTW, did you try to set TMPDIR so that sort does not use /tmp but /var/tmp or any other directory with more disk space? Or maybe your sort has a -T option to specify the tempdir.

answered Feb 04 '14 at 19:19

Steffen Ullrich

114,247
10
131
172

Hi Steffen. If there are one hundred different lines within the 10 GB input file, the efficient alternative for `sort | uniq -c` requires just the memory for these hundred different lines. Thank you for your advice tu use `DIRTMP` (and `-T` option). I have just changed the disk, and my `sort | uniq -c` took two minutes to complete. Thanks :-) However, I am still interested about an alternative consuming less memory... Cheers – oHo Feb 04 '14 at 19:32
@olibre, You need at least as much memory as it takes to store each unique line. A complicated program could take advantage of the fact that the data is already present in a file, but that would take a large amount of coding, and it would be quite slow. – ikegami Feb 04 '14 at 19:44
If each record is 1MB big but you need to sort only by the first 10 bytes you could just read only these 10 bytes for each record in memory together with the offset the record has in the file. Then you can quickly sort these small items in memory and finally open a new file and transfer the records from the old file in the sorted order (that's why you initially stored the offsets) into the new file. This would probably be much faster than a simple sort using the whole record and it would consume less memory too. – Steffen Ullrich Feb 04 '14 at 20:11

score 3 · Answer 3 · answered Feb 05 '14 at 14:12

3

With GNU awk for sorted associative arrays:

$ gawk '
    BEGIN{ PROCINFO["sorted_in"] = "@val_num_desc" }
    { a[$0]++ }
    END { for (i in a) print a[i], i }
' file
3 a
2 BB
1 dddddddd
1 ccccc

No idea if it'll work efficiently enough for your large data set, just showing an awk sorted associative array as requested in the OPs comments below his question.

answered Feb 05 '14 at 14:12

Ed Morton

188,023
17
78
185

Thanks Ed for teaching about [tag:gawk] sorted table :-) But it seems it does not work on RedHat 6.5 (GNU Awk 3.1.7) Cheers – oHo Feb 10 '14 at 10:12
Correct- it's in gawk 4.something. You're on a very, very old version of gawk, you should get the latest one as you're missing a lot of useful functionality. – Ed Morton Feb 10 '14 at 15:26

alternative for command line "sort | uniq -c | sort -n"

How to get the same result as `sort | uniq -c | sort -nr | head` on very large files?

input example

one of the possible outputs

3 Answers3

Linked

alternative for command line "sort | uniq -c | sort -n"

How to get the same result as sort | uniq -c | sort -nr | head on very large files?

input example

one of the possible outputs

3 Answers3

Linked

How to get the same result as `sort | uniq -c | sort -nr | head` on very large files?