1

I am very stumped. I am searching multiple files for multiple lines that look like this (by find-ing the desired start date) and piping to grep so I can extract group of lines with this command:

find logdir/ -type f -regextype sed -regex ".*2016-06-22.*" | while read fname
do
  zgrep -a -P -B9 ".*COOKTHE.*slave.*" $fname
done

So I can output groups of lines this:

2017-05-10 12:14:54 DEBUG[dispatcher-1533] something.else.was.here.Pia - http://server:9999/cookout/123123123123/entry c7aab5a3-0dab-4ce1-b188-b5370007c53c request:
 HEADERS:
 Host: server:9999
 Accept: */*
 User-Agent: snakey-requests/2.12.3
 Accept-Encoding: gzip, deflate
 Connection: keep-alive
 Timeout-Access: <function1>
 CONTENT:
  {"operation": "COOKTHE", "reason": "sucker verified", "username": "slave"}

I'm trying to extract from the first line match, the entire string date pattern (2017-05-10 12:14:54) the digit pattern 123123123123 and from the last line, the entire line match. ({"operation": "COOKTHE", "reason": "sucker verified", "username": "slave"})

How can I extract these with grep, sed, or awk?

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
Unpossible
  • 603
  • 6
  • 23

3 Answers3

2

First, let's simplify your initial query. I don't think you need a regex there; globbing is simpler, faster, and more legible. Similarly, you don't need grep's -P option because you're not using a PCRE. That slows things down as well.

find logdir/ -type f -name '*2016-06-22*' | while read fname
do
  zgrep -a -B9 '"COOKTHE".*"slave"' "$fname"
done | grep -e ^20 -e '{'

That recreates your original logic but should run a bit faster. It also adds a filter to show just the two lines you've asked for. However, I worry that -B9 isn't a good solution since there may be a variable number of headers to track. The final filter is also somewhat rudimentary just to be quick.

Here's a more complete solution:

find logdir/ -type f -name '*2016-06-22*' | while read fname
do
  zcat "$fname" | awk '
    /^20/ && $6 ~ /^http/ {
      split($6, url, "/")           # split the URL by slashes
      stamp = $1 " " $2 " " url[5]  # "2017-05-10 12:14:54 123123123123"
    }
    /{.*"COOKTHE".*"slave"/ { print stamp; print }
  '
done

This saves the date, time, and the 5th fragment of the URL in the stamp variable and prints it only when you've got a match in the JSON line. I modified your regex to include a { to indicate the start of the JSON as well as quotes to improve your match, but you can change it to whatever you like. You don't need a leading or trailing .* on this regex.

AWK concatenates adjacent items, so $1 " " $2 " " url[5] merely represents the value of the first column, a space, the second column, another space, then the URL's 5th item (noting the empty item following "http:").

This won't tell you which file the matching text came from (compare to grep -H). To do that, you want:

  zcat "$fname" | awk -v fname="$fname:" '
    # … (see above)
    /{.*"COOKTHE".*"slave"/ { print fname stamp; print fname $0 }
  '

If the JSON strings you're looking for are consistently placed and spaced, you could instead make that final clause $2 ~ /"COOKTHE"/ && $NF ~ /"slave"/ which would improve awk's speed (actually, its ability to fail faster) on longer lines.

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
1

awk solution for your current input:

awk 'NR==1{ sub(/http:\/\/[^\/]+\/[^\/]+\//,"",$6); 
     print $1,$2,substr($6,1,index($6,"/")-1)}END{ print $0 }' input

The output:

2017-05-10 12:14:54 123123123123
  {"operation": "COOKTHE", "reason": "sucker verified", "username": "slave"}
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Yeah, mistook that as well. I think OP wants everything from the first line match to the last line match (and the lines in between). Got that from the `-B9` in the Q. – Alfe Jul 19 '17 at 09:58
  • I'm actually looking for matches in the first line and to grab the entire last line as well. Sorry for the misconception. – Unpossible Jul 19 '17 at 11:49
  • Also can I just pipe the results of the find to this awk command? I tried to and the result is on two lines – Unpossible Jul 19 '17 at 11:51
  • I kind of kludged together this monster from your awk: `find /opt/vardar/logs/ -type f -regextype sed -regex ".*2016-06-22.*" | while read fname; do zgrep -a -P -B9 ".COOKTHE.*slave.*" $fname | grep -vE 'COOKTHE|^--$'| grep DEBUG | awk -F '[ ]|/' '{print $1,$2,$10}'; done` and it seems to be working. – Unpossible Jul 19 '17 at 13:01
1
… | while read fname
do
  zcat "$fname" | tr '\n' '\f' |
    grep -o -P '\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d.*?COOKTHE[^}]*\}' |
      tr '\f' '\n'
done

If your input already contains formfeed-characters (\f), you can use any other character which should not appear in the input instead.

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • 1
    When I run this I get a lot of `grep: exceeded PCRE's backtracking limit` without anything else produced. – Unpossible Jul 19 '17 at 13:02
  • If you have no `}` elsewhere, you might be able to solve the issue with inserting `| sed 's/\}\f/}\n/g'` after the first `tr` call. This will split the long one-line input into chunks separated after the closing brace and probably avoid the overload of the `grep` process. – Alfe Jul 19 '17 at 13:11
  • Thanks a lot! Very interesting method. – Unpossible Jul 19 '17 at 13:38
  • That's euphemistic. It's a hack, don't lie to yourself ;-) You can tell by the strange limits we encountered (backtracking_limit) or know of (no further closing braces in input allowed). But if it fits the need at hand, it's pragmatic enough to be used anyway. – Alfe Jul 19 '17 at 13:54
  • 1
    Hey all the best stuff are hacks. I'm sitting here wondering how to scale up to your hack-fu belt. – Unpossible Jul 19 '17 at 13:56
  • Btw, have a look at this A: https://stackoverflow.com/a/38972737/1281485 Seems that using `awk` provides much nicer solutions. – Alfe Jul 19 '17 at 14:02