Output Substring to Newline from a Raw Text String using Regex

Question

I have a name delimiter that I want to use to extract the whole line where it is found.

[string]$testString = $null

# broken text string of text & newlines which simulates $testString = Get-Content -Raw

$testString = "initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text"

# test1
# simulate text string before(?<content>.*)text string after - this returns "initial text" only (no newline or anything after)
# $testString -match "(?<BOURKE>.*)"

# test2
# this returns all text, including the newlines, so that $testString outputs exactly as it is defined 
$testString -match "(?s)(?<BOURKE>.*)"

#test3
# I want just the line with BOURKE

$result = $matches['BOURKE']

$result

#Test1 finds the match but only prints to the newline. #Test2 finds the match and includes all newlines. I would like to know what is the regex pattern that forces the output to begin 001 BOURKE ...

Any suggestions would be appreciated.

Your desired output is not quite clear, are you looking for the entire line starting from `001` until the end of the line? — Santiago Squarzon, Mar 09 '22 at 02:01
I think you want `$testString -match '(?m)(?.*BOURKE.*?)(?=\r?\n)';$Matches['BOURKE']` — Mathias R. Jessen, Mar 09 '22 at 02:03
As an aside: `(?...)` simply gives the capture group (`(...)`) a self-chosen _name_ (`BOURKE`), which is unrelated to what the capture group's subexpression (`...`) matches. — mklement0, Mar 09 '22 at 02:40
@mklement0 I think I understand your aside. Except BOURKE matches the characters BOURKE? Multi line modifier, Named Capture Group BOURKE, matches all except line terminators, matches as many as possible (greedy), BOURKE matches the characters BOURKE, positive look ahead (?=\r?\n) (Information https://regex101.com/) — Dave, Mar 09 '22 at 04:12
@Dave, I was referring to the regexes _in your question_, namely `(?.*)` and `(?s)(?.*)`, neither of which look for a substring `BOURKE`. Also, while there's no harm in using a (named) capture group, in your simple case you don't even need _any_ capture group, as my updated answer shows. — mklement0, Mar 09 '22 at 04:19

score 2 · Answer 1 · edited Mar 10 '22 at 13:07

2

I find it best to have a match consume up to what is not needed; the \r\n. That can be done with the set nomenclature with the ^ in the set such as [^\r\n]+ which says consume up to either a \r or a \n. Hence everything that is not a \r\n.

To do that use

$testString -match "(?<Bourke>\d\d\d\s[^\r\n]+)"

Also one should try to avoid the * when one knows there will be matchable txt...the * is a greedy type that consumes everything. Usage of the +, one or more, limits the match considerably because the parser doesn't have to try patterns (The zero of the *s zero or more), backtracking as its called which are patently not plausible.

edited Mar 10 '22 at 13:07

mklement0

382,024
64
607
775

answered Mar 09 '22 at 02:08

ΩmegaMan

29,542
12
100
122

1

While there's good advice in your answer, I don't think it answers Dave's question as as asked (his accepting your answer notwithstanding): The question suggests that a line on which `BOURKE` appears _as a substring_ should be reported _as a whole_. By contrast, your solution looks for 3 digits followed by a whitespace character _anywhere on a line_, through the end of that line. – mklement0 Mar 09 '22 at 23:06
1

@mklement0 I did focus on the three digits as an anchor, true. So if `BOURKE` (or anything else) is needed as pre-requisite simply add it to the pre-pattern anchor such as `\d\d\d\sBOURKE[^\r\n]+`. – ΩmegaMan Mar 09 '22 at 23:09
1

I don't think there's an additional constraint (the 3 digits are incidental), so `.*BOURKE.*` will do for LF-only multiline strings, and `.*BOURKE[^\r\n]*` for CRLF ones. – mklement0 Mar 09 '22 at 23:11
1

if you are insisting on using `*` ...then lessen its greedy-ness by `[\r\n]*?` so it doesn't capture extra lines as per the request of the OP to only get one line. – ΩmegaMan Mar 09 '22 at 23:15
1

If there _were_ an additional constraint with respect to how the line _starts_, you'd need something like `(?m)^\d\d\d\s.*BOURKE[^\r\n]*` - note that I'm using `*`, as using `+` would impose additional - possibly undesired - constraints. – mklement0 Mar 09 '22 at 23:15
1

No, `*?` isn't needed in this case. – mklement0 Mar 09 '22 at 23:16
Looking at this conversation I tried to incorporate the consensus solution: `\d\d\d\sBOURKE[^\r\n]*` On reading your comments, I apologize if my simplification (omitting `001`) has led to a misunderstanding. The `\d\d\d` combination is relevant. Furthermore, I had simplified my problem for the post. Where, my original code, has `[string]$refName1 = "001 BOURKE"` So I think the final line being discussed here is now: `$testString -match "([regex]::Escape($refName1)[^\r\n]*)"` Which is working and I hope will do the job. – Dave Mar 10 '22 at 04:42
Thank you mklement0 & ΩmegaMan for your efforts. – Dave Mar 10 '22 at 04:49
Thanks, @Dave, but there's still ambiguity (and your expression won't work as written): If you want your literal substring to only match _at the start of a line_, use `$testString -match ('(?m)^{0}[^\r\n]*' -f [regex]::Escape('001 BOURKE'))` (if you know other chars. follow, you can use `+` in lieu of `*`). If it should match _anywhere on a line_ - while still capturing the _whole line_: `$testString -match ('.*{0}[^\r\n]*' -f [regex]::Escape('001 BOURKE'))` – mklement0 Mar 10 '22 at 13:46
Taking a step back, @Dave: There's what you _actually needed_, which is different from _what you asked for_ in your question. What future readers benefit from are answers that solve the problem _as asked_. Revealing the true requirements later invalidates answers that have already been given, which is why it is important to communicate all requirements up front. Similarly, accepting answers that do not clearly address the question as asked is likely to cause confusion. – mklement0 Mar 10 '22 at 13:56
1

I accept the points you make @mklement0 – Dave Mar 10 '22 at 18:42
1

I find that mostly *straw-man* type arguments were presented, with those being easy to knock down to say "See, I told you so". The strength of my pattern is that it can be built upon, expanded as such, and avoids the backtracking pitfalls of other patterns. – ΩmegaMan Mar 10 '22 at 18:47

mklement0 · Accepted Answer · 2022-03-10T13:33:11.720

Note:

I'm assuming you're looking for the whole line on which BOURKE appears as a substring.
In your own attempts, (?<BOURKE>...) simply gives the regex capture group a self-chosen name (BOURKE), which is unrelated to what the capture group's subexpression (...) actually matches.
For the use case at hand, there's no strict need to use a (named) capture group at all, so the solutions below make do without one, which, when the -match operator is used, means that the result of a successful match is reported in index [0] of the automatic $Matches variable, as shown below.

If your multiline input string contains only Unix-format LF newlines (\n), use the following:

if ($multiLineStr -match '.*BOURKE.*') { $Matches[0] }

Note:

To match case-sensitively, use -cmatch instead of -match.
If you know that the substring is preceded / followed by at least one char., use .+ instead of .*
If you want to search for the substring verbatim and it happens to or may contain regex metacharacters (e.g. . ), apply [regex]::Escape() to it; e.g, [regex]::Escape('file.txt') yields file\.txt (\-escaped metacharacters).
If necessary, add additional constraints for disambiguation, such as requiring that the substring start or end only at word boundaries (\b)

If there's a chance that Windows-format CLRF newlines (\r\n) are present , use:

if ($multiLineStr -match '.*BOURKE[^\r\n]*') { $Matches[0] }

For an explanation of the regexes and the ability to experiment with them, see this regex101.com page for .*BOURKE.*, and this one for .*BOURKE[^\r\n]*

In short:

By default, . matches any character except \n, which obviates the need for line-specific anchors (^ and $) altogether, but with CRLF newlines requires excluding \r so as not to capture it as part of the match.^[1]

Two asides:

PowerShell's -match operator only ever looks for one match; if you need to find all matches, you currently need to use the underlying [regex] API directly; e.g., [regex]::Matches($multiLineStr, '.*BOURKE[^\r\n]*').Value, 'IgnoreCase'
GitHub issue #7867 suggests bringing this functionality directly to PowerShell in the form of a -matchall operator.
If you want to anchor the substring to find, i.e. if you want to stipulate that it either occur at the start or at the end of a line, you need to switch to multi-line mode ((?m)), which makes ^ and $ match on each line; e.g., to only match if BOURKE occurs at the very start of a line:
- if ($multiLineStr -match '(?m)^BOURKE[^\r\n]*') { $Matches[0] }

If line-by-line processing is an option:

Line-by-line processing has the advantage that you needn't worry about differences in newline formats (assuming the utility handling the splitting into lines can handle both newline formats, which is true of PowerShell in general).
If you're reading the input text from a file, the Select-String cmdlet, whose very purpose is to find the whole lines on which a given regex or literal substring (-SimpleMatch) matches, additionally offers streaming processing, i.e. it reads lines one by one, without the need to read the whole file into memory.

(Select-String -LiteralPath file.txt -Pattern BOURKE).Line

^{Add -CaseSensitive for case-sensitive matching.}

The following example simulates the above (-split '\r?\n' splits the multiline input string into individual lines, recognizing either newline format):

(
  @'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'@ -split '\r?\n' |
    Select-String -Pattern BOURKE
).Line

Output:

001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...

^{[1] Strictly speaking, the [^\r\n]* would also stop matching at a \r character in isolation (i.e., even if not directly followed by \n). If ruling out that case is important (which seems unlikely), use a (simplified version of) the regex suggested by Mathias R. Jessen in a comment on the question: .*BOURKE.*?(?=\r?\n)}

Thank you @mklement0 for this. I have also noted your suggestion in https://stackoverflow.com/questions/50210739/how-to-use-a-variable-as-part-of-a-regular-expression-in-powershell to use single quotes. — Dave, Mar 10 '22 at 19:17

Output Substring to Newline from a Raw Text String using Regex

2 Answers2