How do I use regex in grep to match multiple lines and only get the last matched set?

Question

I have a file with some statistics like this

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

and I would like to cut out the last set of statistics (in the example above it would be the last 6 lines). As you can see, the amount of lines for each section can change, but the first and the last line stay constant. I was thinking about using:

"TOTAL" as an anchor point to grab the first and the last line of the wanted block of text
(?s) mode to match all lines in between those two

I ended up with this regex (?m)^.*?TOTAL(?s).*?(?m)TOTAL.*?$ and to use it in Linux, I used this command to get the wanted output using -P regex extension for grep (I haven't had much luck with -E regex extension)

tac con.log | grep -Po "(?m)^.*?TOTAL(?s).*?(?m)TOTAL.*?\$" -m1 | tac

which resulted in this correct output

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

as expected, however this was in my testing environment which uses an old grep version 2.5.3 and when I tried it on my other machine running on Rocky Linux 9, which uses grep version 3.6 I am not getting any match. Considering this regex worked also when testing at regex101.com, I believe this might be a nuance of a newer grep. Is there anything special these newer versions of grep require for a regex like this to work or is there any other way how to get this result (ultimately, it will be used in a bash script)?

Does it have to be `grep`? Seems like it would be easier in `perl` or `awk`. — Barmar, May 25 '23 at 15:29
on your Rocky Linux 9 machine, are you running it inside a script or directly on the console? have you checked for quoting errors caused by env differences? perhaps `'` for the regex? — kevinnls, May 25 '23 at 15:40
Please change `cut out` in `I would like to cut out the last set of statistics` to either `print` or `remove`, whichever it is you mean (which I now think is `print` but originally thought was `remove`). — Ed Morton, May 26 '23 at 11:07

zdim · Answer 1 · 2023-06-11T02:46:41.303

With Perl,^† one way

perl -0777 -wnE'$r = $1 while /(^[0-9\s:-]+TOTAL.+? TOTAL.+?$)/smxg; say $r' file

or

perl -0777 -wnE'say for /.*( ^[0-9\s:-]+ TOTAL.+? TOTAL.+?$ )/smxg' file

This does capture and assign all such records, or matches the whole file, until it gets to the last one, but one has to go over the file; the approach from the question makes three passes over the file. We can process backwards if performance is an issue, like here for example. See the performance effect here.

Altogether I'd recommend a short script instead.

Not sure why grep does what you show; I'd imagine that the above regex should work, even slightly simplified using grep's conventions.

^† In the question as originally posted by the OP there was a perl tag.

jhnc · Answer 2 · 2023-05-26T13:58:24.400

With GNU grep:

grep -zPo '(?s).*\n\K.*TOTAL .*?TOTAL:.*?\n' con.log

This works with 3.7. Seems to mostly work with version 2.20 (appends an extraneous newline). It is likely to be inefficient with huge input files.

I suspect the reason your regex that works at regex101 is failing when used with grep is that grep applies the regex to each line of input in turn. So a regex that tries to match multiple lines at once is always going to fail.

With tac and awk, to avoid reading the entire file:

tac con.log | awk 's+=($3~/^TOTAL:?$/); s>1{exit}' | tac

s starts as zero/false. Each time a start or finish line is found, it is incremented. When non-zero, the line is printed (default action). When both start and finish lines have matched (s==2), we abort.

Assumes only well-formed records in the log. Allows for unrelated data interspersed between statistics records.

If the file could end with a partial record (that should be ignored), there is:

tac con.log | awk '
    !s && $3=="TOTAL:" { s=1 }
    s;
    s && $3=="TOTAL" { exit }
' | tac

If the log file contains no unrelated data (just a list of complete statistics records), then only the termination condition needs to be tested:

tac con.log | awk '1; $3=="TOTAL"{exit}' | tac

Assuming that number of lines of output will never exceed some threshold (here 1000), there is also a straightforward tac and (GNU) grep solution that works whether or not there is a partial final record:

tac con.log |
grep -A1000 -m1 'TOTAL:' |
grep -B1000 -m1 'TOTAL ' |
tac

score 2 · Answer 3 · answered May 26 '23 at 04:50

or just do it the ultra lazy way :

echo '
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed' |

mawk 'BEGIN { RS = ORS = " consumed\n" } END { print }'   
                                                      — or even -
gawk 'BEGIN { RS=(ORS=FS=" consumed\n")"$" } $0=$NF'

2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

pmqs · Answer 4 · 2023-05-25T16:56:37.680

The key observation with your data is you want the data from the last occurrence of TOTAL MEMORY ALLOCATION CONSUMPTION in your input dataset. You can use greedy matching to achieve that

use strict;
use warnings;


my $data = <<EOM;
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-01 01:00:00 COMPONENT | USAGE (%)
2023-01-01 01:00:00 class.zzz.aaa.bbb | 32
2023-01-01 01:00:00 class.fff.aaa.ggg | 20
2023-01-01 01:00:00 TOTAL: 52% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-02 01:00:00 COMPONENT | USAGE (%)
2023-01-02 01:00:00 class.xxx.aaa.bbb | 42
2023-01-02 01:00:00 class.bbb.aaa.zzz | 10
2023-01-02 01:00:00 class.zzz.xxx | 21
2023-01-02 01:00:00 class.xxx.sss.ggg | 5
2023-01-02 01:00:00 TOTAL: 78% out of 100% allocated memory consumed
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed
EOM

$data =~ s/.+           # Do a greedy match
           (?=          # non-capturing group lookahead
              ^         #     Start of a line
              .+?       #     non-greedy match
              TOTAL\sMEMORY\sALLOCATION\sCONSUMPTION # literal string
            )           # end of lookahead
            //smx; # allow . to match newline & ^ to match start of line

print $data;

running that gives

$ perl try.pl 
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

That can all be condensed into a one-liner

cat your data | perl -e 's/.+(?=^.+?TOTAL\sMEMORY ALLOCATION CONSUMPTION)//sm'

tripleee · Answer 5 · 2023-05-29T04:58:19.407

For completeness, a simple Awk script.

awk '/TOTAL MEMORY/ { p=$0; next }
  p { p = p ORS $0 }
  /TOTAL:/ { result=p; p="" }
  END { print result }' file

This implements a simple state machine where we collect all the lines in the current entry into a string and then at the end print out the (last) collected string.

In some more detail, recall that Awk runs the script on each incoming line (or, more broadly, input record) at a time. When we see the first regex, we start collecting items into p, and skip the rest of the script for this line. On subsequent lines, as long as p is nonempty, we add lines to it, separated by ORS, the output record separator (defaults to newline) and then when we reach an input line which matches TOTAL: we stop collecting, and copy the currently collected p into result. Finally, the END block runs after we reach the end of the input stream, and we print whatever string we last collected into result.

In addition to being portable way back to the original AT&T Unix, this is also easy to understand and modify; the regular expressions are trivial, and the overall logic is reasonably simple and obvious.

score 0 · Answer 6 · answered May 26 '23 at 11:03

Using any awk in any shell on every Unix box:

$ awk '/TOTAL /{rec=$0; next} {rec=rec ORS $0} END{print rec}' file
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

score 0 · Answer 7 · answered May 26 '23 at 17:34

Grep with tail will do the job:

$ grep TOTAL: file -B6 | tail -n6
2023-01-01 01:00:00 TOTAL MEMORY ALLOCATION CONSUMPTION:
2023-01-03 01:00:00 COMPONENT | USAGE (%)
2023-01-03 01:00:00 class.xxx.yyy.zzz | 10
2023-01-03 01:00:00 class.xxx.zzz.aaa | 20
2023-01-03 01:00:00 class.zzz.aaa.bbb | 30
2023-01-03 01:00:00 TOTAL: 60% out of 100% allocated memory consumed

How do I use regex in grep to match multiple lines and only get the last matched set?

7 Answers7