0

I have large files of HTTP access logs and I'm trying to generate hourly counts for a specific query string. Obviously, the correct solution is to dump everything into splunk or graylog or something, but I can't set all that up at the moment for this one-time deal.

The quick-and-dirty is:

for hour in 0{0..9} {10..23}
do
  grep $QUERY $FILE | egrep -c "^\S* $hour:"
  # or, alternately
  # egrep -c "^\S* $hour:.*$QUERY" $FILE
  # not sure which one's better
done

But these files average 15-20M lines, and I really don't want to parse through each file 24 times. It would be far more efficient to parse the file and count each instance of $hour in one go. Is there any way to accomplish this?

Joe Fruchey
  • 379
  • 3
  • 10
  • [edit] your question to include some concise, testable sample input and expected output so we can help you. – Ed Morton Jun 20 '19 at 16:37

3 Answers3

1

You can ask grep to output the matching part of each line with -o and then use uniq -c to count the results:

grep "$QUERY" "$FILE" | grep -o "^\S* [0-2][0-9]:" | sed 's/^\S* //' | uniq -c

The sed command is there to keep only the two digit hour and the colon, which you can also remove with another sed expression if you want.

Caveats: this solution works with GNU grep and GNU sed, and will produce no output, rather than "0", for hours with no log entries. Kudos to @EdMorton for pointing these issues out in the comments, and other issues that were fixed in the answer above.

joanis
  • 10,635
  • 14
  • 30
  • 40
  • Ah! perfect. Thanks – Joe Fruchey Jun 20 '19 at 16:35
  • 1
    You should mention that that requires GNU grep and GNU sed, is using a deprecated call to egrep, will behave unexpectedly for various contents of QUERY or FILE, and will produce no output rather than `0` for times when $QUERY isn't present for that timestamp. – Ed Morton Jun 20 '19 at 16:51
  • Thanks for the info, Ed! GNU grep/sed is no issue for me, and the query in my case is fully alphanumeric, no regex, and no spaces in filenames, so I think I'm ok. The main thing I'm curious about is the deprecated call. Can you elaborate? – Joe Fruchey Jun 20 '19 at 17:12
  • 1
    Regexp chars aren't the problem, globbing chars and other strings the shell would interpret are. Just quote the variables to make SURE you're OK (see https://mywiki.wooledge.org/Quotes). And you're good with getting no output instead of 0 for the hours where QUERY isn't present? Wrt the deprecated call, see the man page - https://linux.die.net/man/1/egrep `Direct invocation as either egrep or fgrep is deprecated...` and just don't rely on egrep always being around, just use `grep -E` instead. – Ed Morton Jun 20 '19 at 17:13
  • 1
    Thanks @EdMorton for letting me know egrep is deprecated. Just replaced egrep by grep, since the -E extensions aren't used here anyway. Found this page on the deprecation of egrep: https://unix.stackexchange.com/questions/383448/why-is-direct-invocation-as-either-egrep-or-fgrep-deprecated – joanis Jun 20 '19 at 17:19
1

Assuming the timestamp appears with a space before the 2-digit hour, then a colon after

gawk -v patt="$QUERY" '
    $0 ~ patt && match($0, / ([0-9][0-9]):/, m) {
        print > (m[1] "." FILENAME)
    }
' "$FILE"

This will create 24 files.

Requires GNU awk for the 3-arg form of match()

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
0

This is probably what you really need, using GNU awk for the 3rd arg to match() and making assumptions about what your input might look like, what your QUERY variable might contain, and what the output should look like:

awk -v query="$QUERY" '
    match($0, " ([0-9][0-9]):.*"query, a) { cnt[a[1]+0]++ }
    END {
        for (hr=0; hr<=23; hr++) {
           printf "%02d = %d\n", hr, cnt[hr]
        }
    }
' "$FILE"

Don't really use all upper case for non-exported shell variables btw - see Correct Bash and shell script variable capitalization.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185