How to filter logs easily with awk?

Question

Suppose I have a log file mylog like this:

[01/Oct/2015:16:12:56 +0200] error number 1
[01/Oct/2015:17:12:56 +0200] error number 2
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
[01/Nov/2015:01:02:00 +0200] error number 9
[01/Jan/2016:01:02:00 +0200] error number 10

And I want to find those lines that occur between 1 Oct at 18.00 and 1 Nov at 1.00. That is, the expected output would be:

[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8

I have managed to convert the times to timestamp by using match() and then mktime(). First one finds the specified pattern, that is stored in the array a[] so it can be accessed (interesting to see glenn jackman's answer to access captured group from line pattern for a good example). Since mktime requires a format YYYY MM DD HH MM SS[ DST], I also have to convert the month in the form Xxx into a digit, for which I use an answer by Ed Morton to "convert month from Aaa to xx": awk '{printf "%02d\n",(match("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'.

All together, finally I have the timestamp in the variable mytimestamp:

awk 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
        day=a[1]; month=a[2]; year=a[3];
        hour=a[4]; min=a[5]; sec=a[6]; utc=a[7];
        month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
        mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc);
        mytimestamp=mktime(mydate)
        print mytimestamp
    }' mylog

Returns:

1443708776
1443712376
1443715676

etc.

So now I am ready to convert against the given dates. Since awk takes a lot to handle such format, I prefer to provide them through an external shell variable, using date -d"my date" +"%s" to print the timestamp:

start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")"

All together, this works:

awk start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")" end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {day=a[1]; month=a[2]; year=a[3]; hour=a[4]; min=a[5]; sec=a[6]; utc=a[7]; month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3); mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc); mytimestamp=mktime(mydate); if (start<=mytimestamp && mytimestamp<=end) print}' mylog
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8

However, this seems to be quite a bit of work for something that should be more straight forward. Nonetheless, the introduction of the "Time functions" section in man gawk is

Since one of the primary uses of AWK programs is processing log files that contain time stamp information, gawk provides the following functions for obtaining time stamps and formatting them.

So I wonder: is there any better way to do this? For example, what if the format instead of dd/Mmm/YYYY:HH:MM:ss was something like dd Mmm YYYY HH:MM:ss? Couldn't it be possible to provide the match pattern externally instead of having to change it every time this would happen? Do I really have to use match() and then process that output to then feed mktime()? Doesn't gawk provide a more simple way to do this?

Hi there, I'm not familiarized with awk or gawk, came here because the regex tag and find your question interesting. I'm familiarized with .bat programming though and in such scenarios we use operating system defined variables to this kind of thing. Is it possible to mix enviroment variables with the parameters to the awk ? — Jorge Campos, Dec 16 '15 at 12:02
@JorgeCampos thanks for the comment. Yes, in `awk` you can use environment variables. For example you can say `awk -v myvar="$shell_var" 'BEGIN{print myvar}'` to print a shell variable. See the usage of `-v` to pass it. — fedorqui, Dec 16 '15 at 12:12
Wouldn't that be a solution for your problem? If, of course, there isn't a better way. — Jorge Campos, Dec 16 '15 at 12:13
@JorgeCampos mmm yes, this is in fact one of my questions: can I provide such date format parameters externally to the `match()` function? — fedorqui, Dec 16 '15 at 12:42
According to the docs, no, you can't. The only way I see is you use external variables with it. But as I said I'm not an awk specialist. Maybe someone else know a way! — Jorge Campos, Dec 16 '15 at 12:54

score 2 · Accepted Answer · answered Jan 24 '16 at 20:24

Use ISO 8601 time format!

However, this seems to be quite a bit of work for something that should be more straight forward.

Yes, this should be straightforward, and the reason why it is not, is because the logs do not use ISO 8601. Application logs should use ISO format and UTC to display times, other settings should be considered broken and fixed.

Your request should be split in two parts. The first part canonise the logs, converting dates to the ISO format, the second performs a research:

awk '
match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
  day=a[1]
  month=a[2];
  year=a[3]
  hour=a[4]
  min=a[5]
  sec=a[6]
  utc=a[7];
  month=sprintf("%02d", (match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
  myisodate=sprintf("%4d-%2d-%2dT%2d:%2d:%2d%6s", year,month,day,hour,min,sec,utc);
 $1 = myisodate
 print
}' mylog

The nice thing about ISO 8601 dates – besides them being a standard – is that the chronological order coincide with lexicographic order, therefore, you can use the /…/,/…/ operator to extract the dates you are interested in. For instance to find what happened between 1 Oct 2015 18:00 +0200 and 1 Nov 2015 01:00 +0200, append the following filter to the previous, standardising filter:

awk '/2015-10-01:18:00:00+0200/,/2015-11-01:01:00:00+0200/'

Could you please answer this question of mine http://stackoverflow.com/questions/39853960/parsing-lines-from-a-file-containing-date-time-greater-than-something. I have an open bounty worth 100 on that :) — Sandeepan Nath, Oct 10 '16 at 09:43
The date format in my log files is a bit different. I tried to start with the date format given in this question, by creating a log file, with content same as the one given in the question and tried running your `awk` command like this - `awk `, but I do not get any output. — Sandeepan Nath, Oct 10 '16 at 09:44

karakfa · Answer 2 · 2015-12-16T18:00:59.810

0

without getting into time format (assuming all records are formatted the same) you can use sort | awk combination to achieve the same with ease.

This assumes logs are not ordered, based on your format and special sort option to sort months (M) and awk to pick the interested range. The sorting is based on year, month, and day in that order.

$ sort -k1.9,1.12 -k1.5,1.7M -k1.2,1.3 log | awk '/01\/Oct\/2015/,/01\/Nov\/2015/'

You can easily extend to include time as well and drop the sort if the file is already sorted.

The following has the time constraint as well

awk -F: '/01\/Oct\/2015/ && $2>=18{p=1} 
         /01\/Nov\/2015/ && $2>=1 {p=0} p'

edited Dec 16 '15 at 18:00

answered Dec 16 '15 at 17:55

karakfa

66,216
7
41
56

Note this is even less generic than what I did use in my question and also very case specific. I mean, it works and I am grateful for your effort, but does not help to generalise the problem and provide a good tool to filter logs with a given format and within two given date times. – fedorqui Dec 16 '15 at 22:30
Why there is a need to use two different time formats? If you can use the same format in the logs the script will be trivial. – karakfa Dec 16 '15 at 22:50

score 0 · Answer 3 · answered Dec 21 '15 at 11:49

I would use date command inside awk to achieve this, though no idea how this would perform with large log files.

awk -F "[][]" -v start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
    -v end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" '{
        gsub(/\//,"-",$2);sub(/:/," ",$2);
        cmd="date -d\""$2"\" +%s" ;
        cmd|getline mytimestamp;
        close(cmd);
        if (start<=mytimestamp && mytimestamp<=end) print
}' mylog

How to filter logs easily with awk?

3 Answers3

Use ISO 8601 time format!

Linked