1

So for example we have the following in a file:

START OF NEW LOG ENTRY
first line
second line KEYWORD
third line
START OF NEW LOG ENTRY
first line
second line
third line
etc... (this file goes on in this manner for a long time)
...

I require to extract all lines of each log entry which contain the keyword word "KEYWORD". The corresponding regex (using pcregrep) for this is as follows:

pcregrep -Mo "(?s)(?:^START OF NEW LOG ENTRY)(?:.(?!^START OF NEW LOG ENTRY))*?(?:KEYWORD).*?(?=\nSTART OF NEW LOG ENTRY|\Z)" file

Now this works just fine, and prints the following as expected:

START OF NEW LOG ENTRY
first line
second line KEYWORD
third line

So whats wrong? ... Well, it is my understanding that how regex works, is that after matching that log entry (lines 1-4), the regex engine starts trying to match again from line 2, so the regex engine needlessly traverses 2 lines worth of characters by the time it starts matching from the start of the 2nd log entry, which seems like a waste of time - we should instead just carry on where the last match ended, i.e. line 5.

I thought that placing \G at the beginning of my regex (after the (?s)) would solve this, but it doesn't.

Does anyone have any smart ideas?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Lost Crotchet
  • 1,270
  • 12
  • 17

1 Answers1

0

Using -C0 instead of -o works for me. I confirmed the problem using this modified input:

START OF NEW LOG ENTRY
first line
START
second line KEYWORD
third line
START OF NEW LOG ENTRY
first line
second line
third line
etc... (this file goes on in this manner for a long time)
...

...and this regex:

(?s)^START.*?KEYWORD(?:(?!^START).)*

Using the options -oM, it got this result:

START OF NEW LOG ENTRY
START
first line
second line KEYWORD
third line

START
first line
second line KEYWORD
third line

...confirming that the second match attempt starts on the second line, instead of after the last line of the match. With options -C0 -M, it get just one hit, as desired:

START OF NEW LOG ENTRY
START
first line
second line KEYWORD
third line

-o prints only what's matched instead of the whole line plus context. But it also allows multiple matches per line, and I'm guessing that's the source of the problem. Your regex matches whole lines anyway, so all you need to do is suppress the context.

Here's the actual regex I would use:

(?s)^START OF NEW LOG ENTRY(?:(?!^START OF NEW LOG ENTRY|\bKEYWORD\b).)*+\bKEYWORD\b(?:(?!^START OF NEW LOG ENTRY).)*$

It's a bit more efficient, and it corrects an error in the tempered greedy token: the dot has to go after the lookahead, not before.

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156