2

My regular expression:

(?si)\bStart\b(.*?)\bError\b(.*?)\bEnd\b

That works for scenarios like:

stuff happens  
Start  
stuff happens  
Error  
stuff happens  
End

But also matches Error outside Start and End sequences:

Start  
End  
Error  
Start  
End

How to only match hits like in the first example, when conditions become like scenario #2?

kayleeFrye_onDeck
  • 6,648
  • 5
  • 69
  • 80

3 Answers3

2

PowerShell, using negative look-ahead and assuming that "stuff happens" parts never contain words "start" or "end":

$txt = Get-Content file.txt | Out-String
$pattern = "(?si)\bStart\b((?!\bEnd\b).)*?\bError\b((?!\bStart\b).)*?\bEnd\b"
$regex = New-Object System.Text.RegularExpressions.Regex($pattern)
$regex.Matches($txt)

Explained here.

Community
  • 1
  • 1
Alexander Obersht
  • 3,215
  • 2
  • 22
  • 26
2

Alexander's answer is probably good enough, but I would do it like this:

(?si)\bStart\b(?:(?!\b(?:Start|End)\b).)*\bError\b(?:(?!\b(?:Start|End)\b).)*\bEnd\b

The main advantage of this regex is that it fails more quickly. ((?!\bStart\b).)*? works fine if there is an End where you expect one, but if no match is possible, it still has to go all the way to the next Start (if there is one) or to the end of the document before it can give up on the match.

In fact, you can take it a step further and eliminate backtracking entirely:

(?si)\bStart\b(?>(?:(?!\b(?:Start|End|Error)\b).)*)\bError\b(?>(?:(?!\b(?:Start|End|Error)\b).)*)\bEnd\b

Adding an Error alternative and enclosing that part in an atomic group means if it finds a Start and doesn't find a Error before the next End, it fails immediately.

Here's a PowerShell example (as generated by RegexBuddy):

$regex = [regex] '(?si)\bStart\b(?>(?:(?!\b(?:Start|End|Error)\b).)*)\bError\b(?>(?:(?!\b(?:Start|End|Error)\b).)*)\bEnd\b'
$matchdetails = $regex.Match($subject)
while ($matchdetails.Success) {
    # matched text: $matchdetails.Value
    # match start: $matchdetails.Index
    # match length: $matchdetails.Length
    $matchdetails = $matchdetails.NextMatch()
}

UPDATE: I just realized that I shouldn't have added the Error branch to the second alternation. My regex matches only those Start..End blocks that contain Error exactly once, which is probably too specific. This version matches a block with at least one occurrence of Error in it:

(?si)\bStart\b(?>(?:(?!\b(?:Start|End|Error)\b).)*)\bError\b(?>(?:(?!\b(?:Start|End)\b).)*)\bEnd\b
Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

Alright, so I came back to this after I was able to comprehend what the accepted answer was accomplishing incrementally piece-by-piece, which is so much easier to understand for me than just everything in one line all at once. This alternative answer explains the process from start-to-finish with the original question's goal, which was taking a 3-match pattern in order and making sure undesirable matches don't occur.

Step 1: Get your pattern working before adding exclusions

\bStart\b.*\bError\b.*\bEnd\b

Step 2: Place non-capture groups that check any type of character(s) while moving the . inside of it. These non-capture groups (?:.) are just placeholder for now and mean they'll check any character, so it doesn't break the pattern we already established.

\bStart\b(?:.)*\bError\b(?:.)*\bEnd\b

Step 3: Now we want enclose those non-capture groups in a positive lookahead that has inside it's non-capture group a negative lookahead so we know to fail early if it detects Start, End, or anything but the last Error. We can't really break this piece down more without breaking minimal matching functionality.

\bStart\b(?>(?:(?!\b(Start|End|Error)\b).))*\bError\b(?>(?:(?!\b(Start|End)\b).))*\bEnd\b

Step 4: Now, just add the line-matching filter at the start, and you're good to go!

(?si)\bStart\b(?>(?:(?!\b(Start|End|Error)\b).))*\bError\b(?>(?:(?!\b(Start|End)\b).))*\bEnd\b

I'm a highly visual learner, so I'm sharing a graphic that personally helped me break it down.

enter image description here

kayleeFrye_onDeck
  • 6,648
  • 5
  • 69
  • 80