Powershell RegEx - capturing "too much" (not honoring non-Greedy indicators?)

Question

The code below is returning:

partner=<Partner>
 more stuff <Name>Test</Name>
 other things </Partner>  <Partner>
 more stuff <Name>CompanyX</Name>
 other things </Partner>

but I want it to return:

partner=<Partner>
 more stuff <Name>CompanyX</Name>
 other things </Partner>

Sample Code:

$partyName = "CompanyX" 

#$bindings = [IO.File]::ReadAllText($inputFileName)

$bindings = "starting stuff <Partner>`r`n more stuff <Name>Test</Name>`n other things </Partner>  <Partner>`r`n more stuff <Name>CompanyX</Name>`n other things </Partner> ending stuff" 


$found = $bindings -match "(?s)(<Partner>.*?<Name>$partyName</Name>.*?</Partner>)"

if ($found) 
{
    Write-Host "matched"
    $partner = $matches[1]
}

Write-Host "partner=$partner "

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Maximilian Burszley, Aug 10 '18 at 21:55
In short: Don't parse XML yourself with regex... Use an xml parser. — Maximilian Burszley, Aug 10 '18 at 21:56
Deleting my answer because it was far too fragile. I'm relatively certain someone that's very familiar with `balanced constructs` can give you a reasonable regex...But I suspect even then a manual parsing solution is going to be easier for most people to read. — zzxyz, Aug 10 '18 at 22:19
The basic issue, to summarize, is that as soon as the regex engine sees its first ``, it starts working to make THAT match. With *THAT* match, it is honoring the lazy indicator as much as possible. It's basically working left to right, in other words. — zzxyz, Aug 10 '18 at 22:42

score 3 · Answer 1 · answered Aug 11 '18 at 11:49

3

As TheIncorrigible1 says: Use an xml parser instead of Regex.

However.. Since the reason for doing it with regex for you might simply be te see IF and HOW it can be done using Regular Expression you can use:

$found = $bindings -match "(?sx)(<Partner>(?:((?!</Partner>).)+<Name>$([Regex]::Escape($partyName))</Name>)(?:((?!</Partner>).))*</Partner>)"

answered Aug 11 '18 at 11:49

Theo

57,719
8
24
41

Reviewing this years later... it came up as a popular question. My data is not XML. It has substitution tags in it (that look kind of like XML), but the file itself is not well-formed xml. – NealWalters Nov 18 '22 at 20:10

score 0 · Answer 2 · answered Aug 12 '18 at 20:41

The non-greedy duplication symbols (.*?) are being honored, but they're not enough in this case:

<Partner>.*?<Name>$partyName</Name> matches between <Partner> and the next instance of the <Name> element, but that doesn't guarantee that there won't be another <Partner> tag in between.
In other words: Your regex will invariably match between the first <Partner> tag and the <Name> element of interest.

To prevent that, you need a negative look-ahead assertion ((?!...)) that rules out intervening <Partner> tags:

# Sample input, defined as a here-string.
$bindings = @'
starting stuff <Partner>
more stuff <Name>Test</Name>
 other things </Partner> <Partner>
 stuff of interest before <Name>CompanyX</Name>
 stuff of interest after </Partner> even more </Partner> ending stuff
'@ 

# Escape the name to ensure it is treated as a literal inside the regex.
# Note: Not strictly necessary for sample value 'CompanyX'
$partyName = [regex]::Escape('CompanyX')

# Use a negative look-ahead assertion - (?!...) - to rule out intervening
# <Partner> tags before the <Name> element of interest.
if ($bindings -match "(?s)<Partner>((?!<Partner>).)*<Name>$partyName</Name>.*?</Partner>") {
  # Output the match.
  $matches[0]
} else { 
  Write-Warning 'No match.'
}

The above yields:

<Partner>
 stuff of interest before <Name>CompanyX</Name>
 stuff of interest after </Partner>

(?!<Partner>). matches a single character (.) not preceded by string <Partner>.
This subexpression must itself be matched against each character (if any) between the opening <Partner> and the <Name> element of interest, hence it is wrapped in (...)*
- I presume this makes for an inefficient matching algorithm, but it does work.
  As mentioned, using proper XML parsing with an XPath query is worth considering as an alternative.
- You could make this matching more efficient by using (?:...)* as the wrapper, which tells the regex engine not to capture (the latest) match of the subexpression. ((...) are capture groups, meaning that what the subexpression matches is reported as part of what automatic variable $Matches returns, which is not needed here, so ?: suppresses that).

Powershell RegEx - capturing "too much" (not honoring non-Greedy indicators?)

2 Answers2