"quick select" (or similar) implementation on Linux? (instead of sort|uniq -c|sort -rn|head -$N)

Question

PROBLEM: Frequently I face a need to see what are the most-frequently-repeated "patterns" within last day of specific logs. Like for a small subset of tomcat logs here:

GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5
GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97
GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18
GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50
DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1
GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40
GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4
GET /app2/provisioning/v3/555412562561928/devices 200 1643 65
...

If I wish to find out the most-frequently-used URLs (along with method and retcode) - I'll do:

[root@srv112:~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\
|sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\
|sort|uniq -c|sort -rn|head -$N
      4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401)
      2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200)
      2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200)
      2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404)
      1 POST /app2/servlet/handler (200)
      1 POST /app1/servlet/handler (200)

If I wish to find out the most-frequent-username from same file - I'll do:

[root@srv112:~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\
|sort|uniq -c|sort -rn|head -$N
      9 555412562561928
      2 555411543532089
      1 555417257243373
      1 555416264256223

Above works quite fine on a small data-sets, but for a larger sets of input - the performance (complexity) of sort|uniq -c|sort -rn|head -$N is unbearable (talking about ~100 servers, ~250 log files per server, ~1mln lines per log file)

ATTEMPT TO SOLVE: |sort|uniq -c part can be easily replaced with awk 1-liner, turning it into:

|awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N

but I failed to find standard/simple and memory-efficient implementation of "Quick select algorithm" (discussed here) to optimize the |sort -rn|head -$N part. Was looking for GNU binaries, rpms, awk 1-liners or some easily-compilable Ansi C code which I could carry/spread across datacenters, to turn:

3   tasty oranges
225 magic balls
17  happy dolls
15  misty clouds
93  juicy melons
55  rusty ideas
...

into (given N=3):

225 magic balls
93  juicy melons
55  rusty ideas

I probably could grab sample Java code and port it for above stdin format (by the way - was surprised by lack of .quickselect(...) within core java) - but the need to deploy java-runtime everywhere isn't appealing. I maybe could grab sample (array-based) C snippet of it too, then adapt it to above stdin format, then test-and-fix-leaks&etc for a while. Or even implement it from scratch in awk. BUT(!) - this simple need is likely faced by more than 1% of people on regular basis - there should've been a standard (pre-tested) implementation of it out there?? Hopes... maybe I'm using wrong keywords to look it up...

OTHER OBSTACLES: Also faced a couple of issues to work it around for large data-sets:

log files are located on NFS-mounted volumes of ~100 servers - so it made sense to parallelize and split the work into smaller chunks
the above awk '{S[$0]+=1}... requires memory - I'm seeing it die whenever it eats up 16GB (despite having 48GB of free RAM and plenty of swap... maybe some linux limit I overlooked)

My current solution is still not-reliable and not-optimal (in progress) looks like:

find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\ 
# TODO: reorder 'find' output to round-robin through srv1 srv2 ...
#       to help 'parallel' work with multiple servers at once
parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\
|awk '{S[\$0]+=1}
END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'"
# I throw away patterns met less than 5 times per log file
# in hope those won't pop on top of result list anyway - bogus
# but helps to address 16GB-mem problem for 'awk' below
awk '{if("count:"==$1){C=$2}else{S[$0]+=C}}
END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\
# I also skip all patterns which are met less than 100 times
# the hope that these won't be on top of the list is quite reliable
sort -rn|head -$N
# above line is the inefficient one I strive to address

Note that using a heap selection algorithm will probably be faster and more memory efficient. Quick select in its simplest form requires that the entire data set be in memory. Heap selection requires only that N items be in memory, so it can work with arbitrarily large data sets. — Jim Mischel, Oct 17 '13 at 18:37
It's possible that [logtop](https://github.com/JulienPalard/logtop) will do what you want. — Jim Mischel, Oct 17 '13 at 18:47
@@sputnick, wow - a powerful tool, even adaptable to custom log formats (LogFile="sed -e 's/.../.../'"), seems too large for my needs - e.g. not sure how well it'd adapt to my process-logs-in-parallel needs or how easy it'll be to configure it every time I need a simple "top of the list", but will surely give it a try. Thank You for sharing! — Vlad, Oct 17 '13 at 19:00
@@Jim Mischel, Thank You! Will seek for "heap selection" implementations - feels like a right direction! As for logtop - it says *I use a hashtable to store strings and an AVL tree to store frequencies* which means it'll likely face same OutOfMemory-issues I had with awk. — Vlad, Oct 17 '13 at 19:01

Daniel Frey · Answer 1 · 2013-10-17T23:23:57.200

I'm not sure if writing your own little tool is acceptable to you, but you can easily write a small tool to replace the |sort|uniq -c|sort -rn|head -$N-part with |sort|quickselect $N. The benefit of the tool is, that it reads the output from the first sort only once, line-by-line and without keeping much data in memory. Actually, it only needs memory to hold the current line and the top $N lines which are then printed.

Here's the source quickselect.cpp:

#include <iostream>
#include <string>
#include <map>
#include <cstdlib>
#include <cassert>

typedef std::multimap< std::size_t, std::string, std::greater< std::size_t > > winner_t;
winner_t winner;
std::size_t max;

void insert( int count, const std::string& line )
{
    winner.insert( winner_t::value_type( count, line ) );
    if( winner.size() > max )
        winner.erase( --winner.end() );
}

int main( int argc, char** argv )
{
    assert( argc == 2 );
    max = std::atol( argv[1] );
    assert( max > 0 );
    std::string current, last;
    std::size_t count = 0;
    while( std::getline( std::cin, current ) ) {
        if( current != last ) {
            insert( count, last );
            count = 1;
            last = current;
        }
        else ++count;
    }
    if( count ) insert( count, current );
    for( winner_t::iterator it = winner.begin(); it != winner.end(); ++it )
        std::cout << it->first << " " << it->second << std::endl;
}

to be compiled with:

g++ -O3 quickselect.cpp -o quickselect

Yes, I do realize you were asking for out-of-the-box solutions, but I don't know anything that would be equally efficient. And the above is so simple, there is hardly any margin for errors (given you don't mess up the single numeric command line parameter :)

Thank You for the effort! I gave it a try - it works, but as You may have guessed the `|sort` part takes enormous time. I've replaced `|sort` with `|awk '{S[$0]+=1}END{for(l in S)for(i=0;i — Vlad, Oct 18 '13 at 14:38

"quick select" (or similar) implementation on Linux? (instead of sort|uniq -c|sort -rn|head -$N)

1 Answers1