PROBLEM: Frequently I face a need to see what are the most-frequently-repeated "patterns" within last day of specific logs. Like for a small subset of tomcat logs here:
GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5
GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97
GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18
GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50
DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1
GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40
GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4
GET /app2/provisioning/v3/555412562561928/devices 200 1643 65
...
If I wish to find out the most-frequently-used URLs (along with method and retcode) - I'll do:
[root@srv112:~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\
|sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\
|sort|uniq -c|sort -rn|head -$N
4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401)
2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200)
2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200)
2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404)
1 POST /app2/servlet/handler (200)
1 POST /app1/servlet/handler (200)
If I wish to find out the most-frequent-username from same file - I'll do:
[root@srv112:~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\
|sort|uniq -c|sort -rn|head -$N
9 555412562561928
2 555411543532089
1 555417257243373
1 555416264256223
Above works quite fine on a small data-sets, but for a larger sets of input - the performance (complexity) of sort|uniq -c|sort -rn|head -$N
is unbearable (talking about ~100 servers, ~250 log files per server, ~1mln lines per log file)
ATTEMPT TO SOLVE: |sort|uniq -c
part can be easily replaced with awk 1-liner, turning it into:
|awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N
but I failed to find standard/simple and memory-efficient implementation of "Quick select algorithm" (discussed here) to optimize the |sort -rn|head -$N
part.
Was looking for GNU binaries, rpms, awk 1-liners or some easily-compilable Ansi C code which I could carry/spread across datacenters, to turn:
3 tasty oranges
225 magic balls
17 happy dolls
15 misty clouds
93 juicy melons
55 rusty ideas
...
into (given N=3):
225 magic balls
93 juicy melons
55 rusty ideas
I probably could grab sample Java code and port it for above stdin format (by the way - was surprised by lack of .quickselect(...)
within core java) - but the need to deploy java-runtime everywhere isn't appealing.
I maybe could grab sample (array-based) C snippet of it too, then adapt it to above stdin format, then test-and-fix-leaks&etc for a while. Or even implement it from scratch in awk.
BUT(!) - this simple need is likely faced by more than 1% of people on regular basis - there should've been a standard (pre-tested) implementation of it out there??
Hopes... maybe I'm using wrong keywords to look it up...
OTHER OBSTACLES: Also faced a couple of issues to work it around for large data-sets:
- log files are located on NFS-mounted volumes of ~100 servers - so it made sense to parallelize and split the work into smaller chunks
- the above
awk '{S[$0]+=1}...
requires memory - I'm seeing it die whenever it eats up 16GB (despite having 48GB of free RAM and plenty of swap... maybe some linux limit I overlooked)
My current solution is still not-reliable and not-optimal (in progress) looks like:
find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\
# TODO: reorder 'find' output to round-robin through srv1 srv2 ...
# to help 'parallel' work with multiple servers at once
parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\
|awk '{S[\$0]+=1}
END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'"
# I throw away patterns met less than 5 times per log file
# in hope those won't pop on top of result list anyway - bogus
# but helps to address 16GB-mem problem for 'awk' below
awk '{if("count:"==$1){C=$2}else{S[$0]+=C}}
END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\
# I also skip all patterns which are met less than 100 times
# the hope that these won't be on top of the list is quite reliable
sort -rn|head -$N
# above line is the inefficient one I strive to address