-1

I have a script that read log files and parse the data to insert them to mysql table..

My script looks like

while read x;do
var=$(echo ${x}|cut -d+ -f1) 
var2=$(echo ${x}|cut -d_ -f3)
...
echo "$var,$var2,.." >> mysql.infile 
done<logfile

The Problem is that log files are thousands of lines and taking hours....

I read that awk is better, I tried, but don't know the syntax to parse the variables...

EDIT: inputs are structure firewall logs so they are pretty large files like

@timestamp $HOST reason="idle Timeout" source-address="x.x.x.x" source-port="19219" destination-address="x.x.x.x" destination-port="53" service-name="dns-udp" application="DNS"....

So I'm using a lot of grep for ~60 variables e.g

sourceaddress=$(echo ${x}|grep -P -o '.{0,0} 
source-address=\".{0,50}'|cut -d\" -f2)

if you think perl will be better I'm open to suggestions and maybe a hint how to script it...

beliz
  • 402
  • 1
  • 5
  • 25
vessel
  • 13
  • 1
  • 6
  • 1
    I don't think `awk` will give you any significant improvement in time.. – sjsam Nov 13 '17 at 08:25
  • Use another language. I have replace bash scripts with Perl a couple times for long tasks and the difference was **enormous**. Shell is slow. – Nic3500 Nov 13 '17 at 08:28
  • 1
    @sjsam why not? see https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice – Sundeep Nov 13 '17 at 08:31
  • @vessel it would help if you add a sample input (say 3-5 lines) and show the expected output you need to append to another file... no need to replicate your full requirement, restrict it to say 3 variables – Sundeep Nov 13 '17 at 08:33
  • @Sundeep :Please note that I have used `significant` in my comment. For larger files `perl` is suggested. Also, the link you pointed out doesn't actually make a comparison between tools. It just discusses ups and downs of a practice. – sjsam Nov 13 '17 at 08:46
  • apart from looping time, the various variables look like just field extraction with appropriate FS declared.. I'd say significant improvement can be expected from the info given in question... – Sundeep Nov 13 '17 at 08:59

3 Answers3

2

To answer your question, I assume the following rules of the game:

  • each line contains various variables
  • each variable can be found by a different delimiter.

This gives you the following awk script :

awk 'BEGIN{OFS=","}
     { FS="+"; $0=$0; var=$1;
       FS="_"; $0=$0; var2=$3;
               ...
       print var1,var2,... >> "mysql.infile"
     }' logfile

It basically does the following :

  • set the output separator to ,
  • read line
  • set the field separator to +, re-parse the line ($0=$0) and determine the first variable
  • set the field separator to '_', re-parse the line ($0=$0) and determine the second variable
  • ... continue for all variables
  • print the line to the output file.
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • This is great, I'm almost done, the only problem I'm facing is that I need to parse a variable coming from geoiplookup ipaddress, now I tried awk -v country="$country", and FS="\""; $0=$0; CIP=$4; but how to notify each line to do country=$(geoiiplookup CIP) I'm getting a syntax error – vessel Nov 14 '17 at 08:32
  • OK I found my answer at https://stackoverflow.com/questions/20646819/how-can-i-pass-variables-from-awk-to-a-shell-command – vessel Nov 14 '17 at 09:58
  • Glad to see you found a solution. If you use `getline` and you have a lot of the same `CIP` values, it might be useful to buffer the results to speedup to program. – kvantour Nov 14 '17 at 12:55
  • well there are lot of the same values, how should I buffer the results...? – vessel Nov 14 '17 at 20:32
  • It depends a bit on what you are doing, but you could have the following awk line `(buffer[CIP]==0) { cmd="geoiiplookup "CIP; cmd | getline buffer[CIP]; close(cmd) }`. This would buffer the result, i.e. store it in an array. If the value already exists, don't execute `geoiiplookup` anymore but just pick the result from `buffer[CIP]` – kvantour Nov 15 '17 at 13:29
  • I don't how should I buffer geoip addresses as to compare them I need to run the " geoiplookup CIP," but there are plenty of other fields that are pretty much the same... Can I set them with buffer[field]==value and speed up the parsing of the logs? – vessel Nov 16 '17 at 08:11
  • The above lines, buffer on `CIP`, i.e. if I already encountered `CIP`, do not run `geoiiplookup` but take the output from the buffer, otherwise, store the `geoiipllokup` output in the buffer for `CIP`. Nonetheless, I do think that this problem should go in a new post. – kvantour Nov 16 '17 at 10:12
0

The perl script below might help:

perl -ane '/^[^+]*/;printf "%s,",$&;/^([^_]*_){2}([^_]*){1ntf "%s\n",$+' logfile

Since, $& can result in performance penalty, you could also use the /p modifier like below :

perl -ane  '/^[^+]*/p;printf "%s,",${^MATCH};/^([^_]*_){2}([^_]*){1}_.*/;printf "%s\n",$+' logfile

For more on perl regex matching refer to [ PerlDoc ]

sjsam
  • 21,411
  • 5
  • 55
  • 102
0

if you're extracting the values in order, something like this will help

$ awk -F\" '{for(i=2;i<=NF;i+=2) print $i}' file 

idle Timeout
x.x.x.x
19219
x.x.x.x
53
dns-udp
DNS

you can easily change the output format as well

$ awk -F\" -v OFS=, '{for(i=2;i<=NF;i+=2) 
                        printf "%s", $i ((i>NF-2)?ORS:OFS)}' file

idle Timeout,x.x.x.x,19219,x.x.x.x,53,dns-udp,DNS
karakfa
  • 66,216
  • 7
  • 41
  • 56