So for example we have the following in a file:
START OF NEW LOG ENTRY
first line
second line KEYWORD
third line
START OF NEW LOG ENTRY
first line
second line
third line
etc... (this file goes on in this manner for a long time)
...
I require to extract all lines of each log entry which contain the keyword word "KEYWORD". The corresponding regex (using pcregrep) for this is as follows:
pcregrep -Mo "(?s)(?:^START OF NEW LOG ENTRY)(?:.(?!^START OF NEW LOG ENTRY))*?(?:KEYWORD).*?(?=\nSTART OF NEW LOG ENTRY|\Z)" file
Now this works just fine, and prints the following as expected:
START OF NEW LOG ENTRY
first line
second line KEYWORD
third line
So whats wrong? ... Well, it is my understanding that how regex works, is that after matching that log entry (lines 1-4), the regex engine starts trying to match again from line 2, so the regex engine needlessly traverses 2 lines worth of characters by the time it starts matching from the start of the 2nd log entry, which seems like a waste of time - we should instead just carry on where the last match ended, i.e. line 5.
I thought that placing \G
at the beginning of my regex (after the (?s)
) would solve this, but it doesn't.
Does anyone have any smart ideas?