2

I have the following input and I'd like to write a regular expression which would match every line except the first and last.

2019-03-13 00:33:44,846 [INFO] -:  foo
2019-03-13 00:33:45,096 [INFO] -:  Exception sending email
To:
[foo@bar.com, bar@bar.com]
CC:
[baz@bar.com]
Subject:
some subject
Body:
some

body
2019-03-13 00:33:45,190 [INFO] -:  bar

I thought the following should work, but it doesn't match anything:

pcregrep -M ".+Exception sending email[\S\s]+?(?=\d{4}(-\d\d){2})" ~/test.log

In plain English I would describe this as: look for a line with the exception text, followed by any character (including newlines) non-greedily, until we hit a positive lookahead for a date.

For some reason this also includes the final line, even though it doesn't on regex101. What am I missing here?


In a lot of cases, I would just use grep -A in a case like this but the problem is that the body could be any arbitrary number of lines.

Michael
  • 41,989
  • 11
  • 82
  • 128
  • Try: [\S\s]+ instead of [.\s]+ - the dot maches a literal dot when inside brackets. – Poul Bak Mar 13 '19 at 15:49
  • It works on regex101 because of the regex options `gm` also... what if in the body of you message a line starts with a date? – Jorge Campos Mar 13 '19 at 15:56
  • 1
    @JorgeCampos I have the multiline flag, the global flag doesn't conceptually apply to *grep though, right? Yeah, I had considered that. The content of the emails is such that in practice that won't happen. It's safe to ignore it. – Michael Mar 13 '19 at 15:58
  • I tested your regex against your input and it does return the expected result on my Debian setup (`pcregrep version 8.39 2016-06-14`). Are you using this exact input to test ? – ttzn Mar 13 '19 at 16:15
  • @CoffeeNinja Yes, the input is the same as listed here. The version is a bit older, however I don't have the rights to change it: `pcregrep version 7.8 2008-09-05` – Michael Mar 13 '19 at 16:17

1 Answers1

2

It almost certainly has to do with the tool. As the changelog for pcregrep states under "Version 8.12 15-Jan-2011" :

  1. In pcregrep, when a pattern that ended with a literal newline sequence was matched in multiline mode, the following line was shown as part of the match. This seems wrong, so I have changed it.

A simple fix would be to add a newline character inside the lookahead expression, which will pull it out of the match and prevent the last line from showing :

pcregrep -M ".+Exception sending email[\S\s]+?(?=[\r\n]\d{4}(-\d\d){2})" ~/test.log
ttzn
  • 2,543
  • 22
  • 26