Parse logs and print certain fields in a csv file

Question

I have a log file that I need help parsing

This is what it looks like:

2018-02-19 15:55:50.070 t.a.ApiUploader [INFO] zzz(708473232) uploaded file 'hdfs://fr-de.int.fz.net:4010/user/profile_export/aId=6/empId=4/classId=10/members-x--491eedd6-2e14-488f-8c13-84be2c6f777b.txt.gz' in 4 chunk(s) - total ops: 31, failed ops: 0
2018-02-19 15:55:50.092 t.a.ApiUploader [INFO] zzz(617022301) uploaded file 'hdfs://fr-de.int.fz.net:4010/user/profile_export/aId=6/empId=4/classId=10/members-x-de10af80-4ac5-4b1a-9675-f7aa9da7ecb2.txt.gz' in 5 chunk(s) - total ops: 45, failed ops: 0
2018-02-19 15:55:50.204 t.a.ApiUploader [INFO] zzz(89993157) uploaded file 'hdfs://fr-de.int.fz.net:4010/user/profile_export/aId=6/empId=4/classId=10/members-x-2aa7808e-a209-4bf8-a744-818724cca054.txt.gz' in 4 chunk(s) - total ops: 32, failed ops: 0

Now what am trying to do is put results of my parsing in an excel file like:

Expected Output:

Date,aId,classId,total ops,failed ops
2018-02-19 15:55:50.070,6,10,31,0
2018-02-19 15:55:50.092,6,10,45,0
2018-02-19 15:55:50.204,6,10,32,0

I can get it separately but how can I combine all into comma separated format? Is there a bash sample to do this?

cat twr.log | awk -F" " {'print $8'} | awk -F"/" {'print $8, $10'}

This gave me:

aId=6 classId=10
aId=6 classId=10
aId=6 classId=10

For date I did this:

cat twr.log | awk -F" " {'print "Date: " $1,$2'}

Date: 2018-04-19 15:55:50.070
Date: 2018-04-19 15:55:50.092
Date: 2018-04-19 15:55:50.204

Any help is appreciated.

Thanks

score 1 · Answer 1 · answered Apr 19 '18 at 19:22

1

$ cat tst.awk
BEGIN { FS="[ /=,]"; OFS="," }
NR==1 { print "Date", "aId", "classId", "total ops", "failed ops" }
{ print $1" "$2, $14, $18, $26, $30 }

$ awk -f tst.awk file
Date,aId,classId,total ops,failed ops
2018-02-19 15:55:50.070,6,10,31,0
2018-02-19 15:55:50.092,6,10,45,0
2018-02-19 15:55:50.204,6,10,32,0

answered Apr 19 '18 at 19:22

Ed Morton

188,023
17
78
185

Thanks for the help.. THIS WORKS FOR SURE!! – user175084 Apr 19 '18 at 20:20
Is there a way after printing the above result.. i can add up the failed and total ops display that? – user175084 Apr 19 '18 at 20:23
Of course, that would be trivial but it looks like you already accepted an answer so - good luck! – Ed Morton Apr 19 '18 at 21:06
With all due respect, but this is just a slightly modified copy of the answer given by @Cyrus. – Andriy Makukha Apr 20 '18 at 04:38
Agreed. I thought about deleting it after I spotted that but I disagree with his use of `echo` with hard-coded commas to produce the header line outside of the awk script even if there was an empty input file so I decided to leave mine posted. I see he's deleted his now anyway. – Ed Morton Apr 20 '18 at 14:13

karakfa · Answer 2 · 2018-04-19T20:39:56.070

0

if you url is in fixed format

$ awk -v OFS=, 'BEGIN{print "Date,aId,classId,total ops,failed ops"}
                     {split($8,a,"/"); 
                      sub(/.*=/,"",a[6]); 
                      sub(/.*=/,"",a[8]); 
                      print $1 FS $2,a[6],a[8],$15 $18}' file

Date,aId,classId,total ops,failed ops
2018-02-19 15:55:50.070,6,10,31,0
2018-02-19 15:55:50.092,6,10,45,0
2018-02-19 15:55:50.204,6,10,32,0

otherwise, you have to pattern match within the elements of array a for the keywords you're interested in.

Note the hacks $1 FS $2 and $15 $18 for the special handling of the output delimiters for special cases

UPDATE

add this to the main block

sum15+=$15; sum18+=$18

and this as the last block in the script.

END {print "sum total ops:",sum15, "sum failed ops:",sum18}

edited Apr 19 '18 at 20:39

answered Apr 19 '18 at 18:23

karakfa

66,216
7
41
56

Is there a way to print the above result. I want to add up the failed and total ops then display that? – user175084 Apr 19 '18 at 20:33
Yeah seems to be an epidemic right now.... just happened over at https://stackoverflow.com/q/49902786/1745001 too. – Ed Morton Apr 20 '18 at 14:17

score 0 · Answer 3 · answered Apr 19 '18 at 18:38

0

You can use match function for this kind of cases:

awk 'BEGIN { OFS =","; print "Date,aId,classId,total ops,failed ops" } 
{ 
    match($8,/aId=([0-9]*)\/.*\/classId=([0-9]*)/,a)
    print $1 " " $2,a[1],a[2],$(NF-3) $NF 
}' YOURFILE

It might be more error-proof when format of URL is not very rigid.

answered Apr 19 '18 at 18:38

Andriy Makukha

7,580
1
38
49

1

You should mention thats gawk-only for the 3rd arg to match(). – Ed Morton Apr 19 '18 at 19:23
1

Yep, thanks. I use only `gawk`, so I don't know what other versions' equivalents would be. But I do believe this approach is more flexible than counting 30 fields by numbers. It works "gracefully" even in case of a missing URL. – Andriy Makukha Apr 19 '18 at 19:59
Is there a way to print the above result. I want to add up the failed and total ops then display that? – user175084 Apr 19 '18 at 20:33

zzxyz · Answer 4 · 2018-04-19T19:09:33.917

0

The parts of the regex in parens get captured as $1 then $2 etc...

perl -lne 'BEGIN{print "Date,aId,classId,total ops,failed ops"} \
print "$1,$2,$3,$4,$5" if /(\S+ \S+).+?aId=(\d+).+?classId=(\d+).+?total ops:\s*(\d+).+?failed ops:\s*(\d+)/' \
inputFile

As a side-note. You can pipe the output of this (or some of the awk commands), like so:

| datamash -sHt , -g classId mean 'total ops' sum 'failed ops' | column ts ,
GroupBy(classId)  mean(total ops)  sum(failed ops)
10                36               0

To pull out the data you might be looking for in Excel. Datamash is available for most package managers (apt, pacman, etc)

edited Apr 19 '18 at 19:09

answered Apr 19 '18 at 18:43

zzxyz

2,953
1
16
31

Thanks for the help.. Will be considering the datamash solution.. is there somethign similar in bash?? – user175084 Apr 19 '18 at 20:22
1

@user175084 built-in? I don’t think so. But of course datamash is accessible from most bash shells. Probably MINGW would be the exception. – zzxyz Apr 19 '18 at 20:33

Michael Back · Accepted Answer · 2018-04-30T18:34:31.390

0

The following answer illustrates the self-documenting power of creating/passing variables to keep track of fields:

awk -v OFS=',' -v date=1 -v time=2 -v url=8 -v url_aid=6 -v url_cid=8 -v total=15 -v failed=18 '
    NR == 1 { print "Date", "aId", "classId", "total ops", "failed ops" }
    {
        split($url, arr, /\//)
        aid = arr[url_aid]; sub(/[^=]+=/, "", aid)
        cid = arr[url_cid]; sub(/[^=]+=/, "", cid)
        sub(/,/, "", $total)
        print $date " " $time, aid, cid, $total, $failed
    }' twr.log

Couple of comments...

The header is separated out on purpose in case OFS is changed in the future
$total is sanitized prior to print for the same reason as we have for separating out the header, but also because in the alternate version such a subtle thing as a missing comma between $total and $failed is easy to gloss over in the reading.
The above solution is easy to work on and think about if there is a need to modify in the future. For instance...

If the link is not fixed, or if filtering is required, something like the following would be more appropriate:

awk -v OFS=',' -v date=1 -v time=2 -v url=8 -v total=15 -v failed=18 '
    NR == 1 { print "Date", "aId", "classId", "total ops", "failed ops" }
    NF == failed {
        aid = cid = ""
        n = split($url, arr, /\//)
        for (i = 4; i <= n; ++i) {
            if (arr[i] ~ /^aId=/)
                aid = arr[i]
            else if (arr[i] ~ /^classId=/)
                cid = arr[i]
            if (aid != "" && cid != "") {
                gsub(/[^0-9]+/, "", aid)
                gsub(/[^0-9]+/, "", cid)
                gsub(/[^0-9]+/, "", $total)
                print $date " " $time, aid, cid, $total, $failed
                next
            }
        }
    }' twr.log

edited Apr 30 '18 at 18:34

answered Apr 19 '18 at 18:51

Michael Back

1,821
1
16
17

Thanks for the help this worked and will be using the not fixed solution! – user175084 Apr 19 '18 at 20:20
Is there a way after printing the above result.. i can add up the failed and total ops display that? – user175084 Apr 19 '18 at 20:23
1

@user175084 Sure! **Awk** is very loosely typed, so you can just add anything to do a string to number conversion (and back again). First append another bit to the header `print` statement to describe your column, and then append `, $TOTAL + $FAILED` to the inner `print` statement. One other thing... the exemplified simple filtering for `NF` and `aid`/`cid` may not be careful enough for what you want, and you may want to add some more criterion beyond `NF == FAILED`. – Michael Back Apr 19 '18 at 20:42
Don't use all upper case variable names in awk (or shell) scripts - leave that for builtin variables (and exported ones in shell). Also, you don't have to manually set the FS to a blank char as that's its default value. – Ed Morton Apr 19 '18 at 21:07
1

@Ed - I appreciate your input as you have been contributing on **awk** for much longer than I have, but I've found it useful in my scripts to differentiate passed params from internal variables. From your experience, what is a good convention to use for passed params? – Michael Back Apr 19 '18 at 21:20
Thanks Michael. I've never thought about doing that but maybe upper CamelCase for one (`Date` and `ClassId`) and lower underscore_separated for the other (`date` and `class_id`)? So you'd have `FileName` passed in, and `file_name` internally and neither would clash with the keyword `FILENAME`. You could even just use lower camelCase internally (`date`, `fileId`, `fileName`) but that's not such an obvious difference. – Ed Morton Apr 19 '18 at 21:32
1

Thanks @Ed -- I'll think about it. I will definitely consider what you have said, knowing that to many leveraging all caps for this may feel like "bad taste." The more I write **awk** though the more I don't care about differentiating **awk** internal variables from passed params... On the other hand, visually differentiating passed params in my **awk** programs from other variables is of huge value to me. – Michael Back Apr 19 '18 at 22:54
1

A scenario to consider - right now you have a variable named `DATE`. Lets say in the next release the gawk maintainers decide to create a variable that holds the current date. They would probably name that variable `DATE`. Now when you get an update of gawk your script will fail to work and you'll have no idea why if you haven't been paying close attention to the release notes. There's already a LOT of built-in variables, especially in gawk, that I suspect few people really think about clashing with (e.g. `ERRNO`), see https://www.gnu.org/software/gawk/manual/gawk.html#Built_002din-Variables. – Ed Morton Apr 19 '18 at 23:00
1

You convinced me @Ed... here’s your tolower(). – Michael Back Apr 21 '18 at 01:00
@Ed -- even though I did the tolower() for people's tastes... I'm not fully convinced about the builtin variable case... Point of fact that GNU has added several functions, all lowercase -- such as strtonum(), and bit operations (I am trying to run some of my scripts with mawk now & ran into these as issues -- sigh). It is impossible to know if GNU or any other flavor of Awk will add date() as a function or DATE as a variable in the future. So, I'll use lowercase for vars in this forum to keep peoples eyes from bleeding, but I will continue to ponder about what to do in my company scripts. – Michael Back Apr 30 '18 at 18:50
@Michael if you have a variable name that clashes with a function name you'll get an immediate syntax error telling you what the error is and where it occurred and so you can easily and immediately fix it. If you have a variable name that clashes with a builtin variable name then you will get no warning, your script will just quietly, cryptically produce bad output that might take you weeks to notice and days of head-scratching to debug. That's why a clash of variable names is so much worse. – Ed Morton Apr 30 '18 at 21:02

score 0 · Answer 6 · answered Apr 19 '18 at 21:18

0

No need to whip out the awks and perls for this small nugget. Here is a Bash-only solution :

while read date1 date2 _ _ _   _ _ url _ _   _ _ _ _ failed  _ _ total; do
  IFS=/ read _ _ _ _ _  _ _ aid _ classId _ <<< "$url"
  printf "%s,%s,%s,%s,%s\n" "$date1 $date2" "$aid" "$classId" "$total" "${failed%,}"
done < file.log

Feels like Bash was made to parse CSV, doesn't it ? ;-)

answered Apr 19 '18 at 21:18

Marc Coiffier

194
1
5

Not even a little bit :-). That will be orders of magnitude slower than an equivalent awk script and contains bugs (see [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice)) – Ed Morton Apr 19 '18 at 23:05
Perhaps that's the price of readability (no pun intended) :-) I know that the `read` builtin is not the fastest way to read text, but I like that it allows one to give legible names to the fields. I don't write Bash for performance or correctness (for that I have Haskell), I do it for the free cookies. – Marc Coiffier Apr 20 '18 at 01:34

Parse logs and print certain fields in a csv file

6 Answers6