read line by line with awk and parse variables

Question

I have a script that read log files and parse the data to insert them to mysql table..

My script looks like

while read x;do
var=$(echo ${x}|cut -d+ -f1) 
var2=$(echo ${x}|cut -d_ -f3)
...
echo "$var,$var2,.." >> mysql.infile 
done<logfile

The Problem is that log files are thousands of lines and taking hours....

I read that awk is better, I tried, but don't know the syntax to parse the variables...

EDIT: inputs are structure firewall logs so they are pretty large files like

@timestamp $HOST reason="idle Timeout" source-address="x.x.x.x" source-port="19219" destination-address="x.x.x.x" destination-port="53" service-name="dns-udp" application="DNS"....

So I'm using a lot of grep for ~60 variables e.g

sourceaddress=$(echo ${x}|grep -P -o '.{0,0} 
source-address=\".{0,50}'|cut -d\" -f2)

if you think perl will be better I'm open to suggestions and maybe a hint how to script it...

I don't think `awk` will give you any significant improvement in time.. — sjsam, Nov 13 '17 at 08:25
Use another language. I have replace bash scripts with Perl a couple times for long tasks and the difference was **enormous**. Shell is slow. — Nic3500, Nov 13 '17 at 08:28
@sjsam why not? see https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice — Sundeep, Nov 13 '17 at 08:31
@vessel it would help if you add a sample input (say 3-5 lines) and show the expected output you need to append to another file... no need to replicate your full requirement, restrict it to say 3 variables — Sundeep, Nov 13 '17 at 08:33
@Sundeep :Please note that I have used `significant` in my comment. For larger files `perl` is suggested. Also, the link you pointed out doesn't actually make a comparison between tools. It just discusses ups and downs of a practice. — sjsam, Nov 13 '17 at 08:46
apart from looping time, the various variables look like just field extraction with appropriate FS declared.. I'd say significant improvement can be expected from the info given in question... — Sundeep, Nov 13 '17 at 08:59

score 2 · Accepted Answer · answered Nov 13 '17 at 15:24

2

To answer your question, I assume the following rules of the game:

each line contains various variables
each variable can be found by a different delimiter.

This gives you the following awk script :

awk 'BEGIN{OFS=","}
     { FS="+"; $0=$0; var=$1;
       FS="_"; $0=$0; var2=$3;
               ...
       print var1,var2,... >> "mysql.infile"
     }' logfile

It basically does the following :

set the output separator to ,
read line
set the field separator to +, re-parse the line ($0=$0) and determine the first variable
set the field separator to '_', re-parse the line ($0=$0) and determine the second variable
... continue for all variables
print the line to the output file.

answered Nov 13 '17 at 15:24

kvantour

25,269
4
47
72

This is great, I'm almost done, the only problem I'm facing is that I need to parse a variable coming from geoiplookup ipaddress, now I tried awk -v country="$country", and FS="\""; $0=$0; CIP=$4; but how to notify each line to do country=$(geoiiplookup CIP) I'm getting a syntax error – vessel Nov 14 '17 at 08:32
OK I found my answer at https://stackoverflow.com/questions/20646819/how-can-i-pass-variables-from-awk-to-a-shell-command – vessel Nov 14 '17 at 09:58
Glad to see you found a solution. If you use `getline` and you have a lot of the same `CIP` values, it might be useful to buffer the results to speedup to program. – kvantour Nov 14 '17 at 12:55
well there are lot of the same values, how should I buffer the results...? – vessel Nov 14 '17 at 20:32
It depends a bit on what you are doing, but you could have the following awk line `(buffer[CIP]==0) { cmd="geoiiplookup "CIP; cmd | getline buffer[CIP]; close(cmd) }`. This would buffer the result, i.e. store it in an array. If the value already exists, don't execute `geoiiplookup` anymore but just pick the result from `buffer[CIP]` – kvantour Nov 15 '17 at 13:29
I don't how should I buffer geoip addresses as to compare them I need to run the " geoiplookup CIP," but there are plenty of other fields that are pretty much the same... Can I set them with buffer[field]==value and speed up the parsing of the logs? – vessel Nov 16 '17 at 08:11
The above lines, buffer on `CIP`, i.e. if I already encountered `CIP`, do not run `geoiiplookup` but take the output from the buffer, otherwise, store the `geoiipllokup` output in the buffer for `CIP`. Nonetheless, I do think that this problem should go in a new post. – kvantour Nov 16 '17 at 10:12

sjsam · Answer 2 · 2017-11-13T09:48:17.037

0

The perl script below might help:

perl -ane '/^[^+]*/;printf "%s,",$&;/^([^_]*_){2}([^_]*){1ntf "%s\n",$+' logfile

Since, $& can result in performance penalty, you could also use the /p modifier like below :

perl -ane  '/^[^+]*/p;printf "%s,",${^MATCH};/^([^_]*_){2}([^_]*){1}_.*/;printf "%s\n",$+' logfile

For more on perl regex matching refer to [ PerlDoc ]

edited Nov 13 '17 at 09:48

answered Nov 13 '17 at 09:36

sjsam

21,411
5
55
102

score 0 · Answer 3 · answered Nov 13 '17 at 15:22

if you're extracting the values in order, something like this will help

$ awk -F\" '{for(i=2;i<=NF;i+=2) print $i}' file 

idle Timeout
x.x.x.x
19219
x.x.x.x
53
dns-udp
DNS

you can easily change the output format as well

$ awk -F\" -v OFS=, '{for(i=2;i<=NF;i+=2) 
                        printf "%s", $i ((i>NF-2)?ORS:OFS)}' file

idle Timeout,x.x.x.x,19219,x.x.x.x,53,dns-udp,DNS

read line by line with awk and parse variables

3 Answers3

Linked