Using AWK to find missing dates in a log

Question

I am trying to find missing dates in a log file. Essentially, I have 2 input files, an 'eventlist' and an 'eventlog' that look like this:

eventlist
EV01 Event number one
EV02 Event number two

eventlog
2014-09-14 EV01
2014-09-16 EV01
2014-09-20 EV01
2014-09-21 EV01
2014-09-22 EV01
2014-09-23 EV01
2014-09-24 EV01
2014-09-25 EV01
2014-09-14 EV02
2014-09-22 EV02
2014-09-23 EV02
2014-09-24 EV02
2014-09-25 EV02

I am trying to see the number of consecutive days (from today) that I have eventlog records for. Based on the file above, I would like the output below:

6 Event number one
4 Event number two

So far I have the script below, but it returns me a count of occurrences for each event:

awk 'NR==FNR { a[$1]=$0; next }{print $1,a[$2]}' eventlist eventlog | awk '{print substr($0, index($0, $3))}' | awk -F, '!z[$1]++{ a[$1]=$0; } END {for (i in a) print z[i], a[i]}'

This currently returns:

8 Event number one
5 Event number two

Any ideas on how I can modify the above to show me the number of sequential days (up to today) instead of a total count?

score 2 · Accepted Answer · answered Sep 26 '14 at 02:43

2

I love challenges like this. It's late here: explanations upon request tomorrow.

gawk '
    BEGIN { today = strftime("%F", systime()) }
    function day_before(date) {
        gsub(/-/, " ", date)
        return strftime("%F", mktime(date " 12 00 00") - 86400)
    }
    NR == FNR  { id = $1; $1 = ""; event[id] = $0; next }
    $NF != eid { day = today; eid = $NF }
    $1 > today { next }
    $1 == day  { count[eid]++; day = day_before(day) }
    END { for (id in count) print count[id], event[id] }
' eventlist <(tac eventlog)

6  Event number one
4  Event number two

answered Sep 26 '14 at 02:43

glenn jackman

238,783
38
220
352

Hi Glenn - Thanks for the answer. This works great. Would you mind walking through this so I can follow? I've accepted the solution, just want to make sure I understand how it works. – armohan Sep 26 '14 at 10:47
I wouldn't mind, but first are there any **specific** parts you don't understand? – glenn jackman Sep 26 '14 at 10:56
specifically the 2 lines after the function: NR == FNR { id = $1; $1 = ""; event[id] = $0; next } $NF != eid { day = today; eid = $NF } – armohan Sep 26 '14 at 12:31
`NR==FNR` uses 2 awk variables: `NR` == the number of the current record, `FNR`, the number of the current record for the current file. Only for the first file can `NR==FNR`. This action reads the eventlist, storing the events in the `event` associative array. – glenn jackman Sep 26 '14 at 12:37
`NF` is the number of fields in the current record. `$` is a operator that returns the *value* of a field. $1 is the value of the first field, $NF is the value of the last field. `eid` is a user-defined awk variable. Undefined variables act like the empty string or the number 0 (depending on the context). This action kicks in when the last field **changes** from the previous record. – glenn jackman Sep 26 '14 at 12:41

score 1 · Answer 2 · edited May 23 '17 at 11:57

An alternative, suggested by an answer to Awk to calculate number of days between two dates, would be (assuming for simplicity that there is a tab between EV01 and Event number one in the eventlist file):

#!/bin/sh
cut -f2 -d" " eventlog >ev.tmp
cut -f1 -d" " eventlog | date -f - +%s | awk '{print int($0/86400)}' \
    | paste - ev.tmp | awk '{if (lastDay[$2] == $1-1) consecCount[$2]++; 
else consecCount[$2]=1; lastDay[$2] = $1} 
    END {for (i in consecCount) print i "\t" consecCount[i]}' \
        | sort | join -t"   " - eventlist | cut -f2,3

The key step here is that date -f converts a file full of dates into seconds since the epoch, so we can divide that number by the number of seconds in a day (86400) to find the number of days since the epoch. Finding the most recent number of consecutive days for each event is then straightforward and we can match the longer labels to each event count with a combination of join (using a tab as the field delimiter) and cut.

This solution uses more tools than @glenn jackman's solution but avoids the need for mktime() and strftime(), which may not be available in all dialects of awk.

Simon, I've tried this and the 'join' part does not seem to work. I get an error thrown that says "join: multi-character tab ' '". I tried modifying the spacing next to 'join -t' and the script runs, but I don't see any output? — armohan, Sep 26 '14 at 10:52
@armohan: It is tricky to persuade `join` to use a tab as a delimiter but pressing the key between double quotes worked for me. Alternatively, you can use the keystrokes or you can use other techniques shown in answers to the question [Unix join separator char](http://stackoverflow.com/questions/1722353/unix-join-separator-char). — Simon, Sep 27 '14 at 03:20

Using AWK to find missing dates in a log

2 Answers2