Match a specific line that is not followed by another specific line before the next occurrence of the first one

Question

I will start with example as it might be the easiest explanation. We have a multi-line file:

...
STARTING LINE with something 83
...
STARTING LINE with other 12
...
ENDING LINE with yet another info
...
STARTING LINE with another 43
...

The ... means anything (multiple lines including empty lines) except STARTING LINE .* and ENDING LINE .*.

We have to capture groups containing all STARTING LINE .* that are not followed by ENDING LINE .* which means the first and the last occurrence of STARTING LINE .* in the example.

The number of occurrences of STARTING LINE .* alone and STARTING LINE .*...ENDING LINE .* pairs is not known.

I have tried multiple expressions with positive and negative, forward and backward lookaheads, but never managed to capture occurrences properly.

I can provide more examples if needed, but it might be hard to give you the expressions I've already tried as I didn't keep track of them and the current ones captures all occurrences, including the one we don't want:

(^STARTING LINE .*?$)(?!^ENDING LINE)[.\n]+
(^STARTING LINE .*?$(?!.*^ENDING LINE)[.\n]*)

Note that we want to have only the STARTING LINE .* lines in a group.

We use Python 2.7 regex engine with re.MULTILINE flags (gm). Tried also with additional re.DOTALL (s) option with no success.

This is probably a case where regex isn't going to work. Why not loop over the lines and build a list of the lines you want? — darthbith, Oct 29 '18 at 15:02
@darthbith I would do if that would be possible. Unfortunately we use an external tool that lets us do such operations only with regex. — Sebastian Potasiak, Oct 29 '18 at 15:10

Tomasz Linkowski · Accepted Answer · 2018-10-30T05:11:54.270

The following regex works for me in the MULTILINE mode (demo):

^STARTING LINE .+$\n(?!(?:(?!(?:STARTING|ENDING) LINE ).+\n)*ENDING LINE )

Explanation:

^STARTING LINE .+\n: a starting line ($ not needed because of \n)
(?:(?!(?:STARTING|ENDING) LINE ).+\n)*: zero or more middle lines (^ nor $ not needed because of \n)
ENDING LINE: an ending line (^ not needed because of previous \n)

PS. This assumes your line feeds are indeed \n, and not \r\n.

The fourth bird · Answer 2 · 2018-10-29T21:41:39.527

You could use match from STARTING LINE until you encounter a newline and STARTING LINE again using a positive lookahead. This way you know that there is at least one time STARTING LINE between your match.

For the last match you could check using a negative lookahead that you can not match a newline followed by ENDING LINE anymore.

^STARTING LINE(?:.*(?:(?!\n(STARTING|ENDING) LINE)\n.*)*(?=\nSTARTING LINE)|(?![\s\S]*\nENDING LINE)[\s\S]*$)

Regex demo

Explanation

^ Start of the line
STARTING LINE Match literally
(?: Start non capturing group
- .* Match 0+ characters
- (?: Non capturing group
  - (?! Negative lookahead to assert what is on the right side is not
    - \n(STARTING|ENDING) LINE Match newline followed by STARTING LINE or ENDING LINE
  - ) Close capturing group
  - \n.* match a newline and 0+ characters
- )* Close negative lookahead and repeat 0+ times
- (?= Positive lookahead to assert what is on the right side is
  - \nSTARTING LINE Match newline followed by STARTING LINE
- ) Close lookahead
- | Or
- (?! Start negative lookahead
  - [\s\S]*\nENDING LINE Match any character including line break characters 0+ times followed by a newline and ENDING LINE
- ) Close negative lookahead
- [\s\S]*$ Match any character including line break characters 0+ times until the end of the string
) Close non capturing group

score 0 · Answer 3 · answered Oct 29 '18 at 15:18

I am afraid You need to solve it through stream, not with single regex. Something like this:

If helpful here is awk solution:

$ awk '/^STARTING LINE / { if ( startingline > "" ) { print(startingline); startingline=""; } else { startingline=$0; } } /^ENDING LINE / { startingline=""; } END { if ( startingline > "" ) print(startingline); }' file.txt
STARTING LINE with something 83
STARTING LINE with another 43

Match a specific line that is not followed by another specific line before the next occurrence of the first one

3 Answers3