PowerShell Regex Ignore up until character string match including string match

Question

I am trying to read a file and ignore everything up until a character match. Sometimes the character match will appear on the same line with the results I need, so I can't do a Select-Object -Skip x where x is the number of lines returned from a document.

I have tried to use the .Split('<pre>') method on the results, and that worked, but I can't select the index because it's a multi-line string that returned.

Below is the start of an example of text returning. It's a HTML response that I'm trying to read the data out of. I cannot use the Content as it's in ByteArray and has a space between every character. So I've concluded it's time to ask for help with [Regex] in PowerShell to assist.

I was looking at this answer and thought I could use /.+?(?=abc)/ by means of replacing the search string like this:

(Get-Content $env:TEMP\test.txt) | ForEach-Object { 
    [Regex]::Match($_, "^.+(?=\<pre\>)").Value
}

That didn't work either. I'm OK with regex when looking for match like {\d\d\d} to ensure it's 3 digits long, but I'm not sure how to use it in this instance.

This is the start of a file being returned. I need to ignore everything up to and including the characters <pre> and then anything after that to the end of the file is OK.

Example command and result being returned here:

PS> Get-Content $env:TEMP\test.txt

HTTP/1.1 200 OK
Content-Length: 3524
Date: Thu, 18 Jun 2020 15:00:05 GMT
Last-Modified: Fri, 19 Jun 2020 01:00:05 GMT
Server: TTWS/1.2 on Microsoft-HTTPAPI/2.0

<!doctype html><html><body>
    <p>Test TCP WebServer 1.2</p>
    <pre>

    Directory: C:\tmp

EDIT:

I have this now, which removes everything up to and including the first <pre> tag and also removes the closing </pre> tag, but won't remove anything AFTER the closing </pre> tag.

(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?<pre>' -replace '<\/pre>(.+?)'

Can that be expanded to include to the end of the file?

`(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?
'` may work for you here. — AdminOfThings, Jun 18 '20 at 15:36
try `'(?s)^.*?(?=
)'` matches up until
. if want to ignore and match pre to pre that is differant — , Jun 18 '20 at 15:53
Try https://stackoverflow.com/questions/7167279/regex-select-all-text-between-tags — wp78de, Jun 18 '20 at 16:42
@WiktorStribiżew Trying to get everything between the two `pre` tags in the HTML — Danijel-James W, Jun 19 '20 at 00:03
`Get-Content $input_path -Raw | Select-String -Pattern '(?s)(?<=
\s*).*?(?=\s*
)' -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file` — Wiktor Stribiżew, Jun 19 '20 at 00:11
I was looking at that solution using Select-String @WiktorStribiżew just as I saw the comment come through. Unfortunately it's not reading it properly. :( — Danijel-James W, Jun 19 '20 at 00:35
@AdminOfThings I have made an edit to my post. Are you able to show me how we remove from the `` tag to the end of the file also? — Danijel-James W, Jun 19 '20 at 00:46

score 1 · Accepted Answer · answered Jun 19 '20 at 09:30

The .+? pattern is "lazy", non-greedy. It means it will match the least amount of characters that it is allowed to match. Since you have .+? at the end of the pattern, and .+? matches 1 or more characters, it will match one character and quit. You need a greedy quantifier, * or +.

Besides, you can achieve what you need with a single -replace command if you use a capturing group.

You need to use

(Get-Content $env:TEMP\test.txt -Raw) -replace '(?s)^.*?<pre>(.*?)</pre>.*', '$1'

It will take the whole file content and get the text contents between the first <pre> string and the closest </pre>.

Pattern details

(?s) - a RegexOptions.Singleline inline modifier making . match newlines, too
^ - start of string
.*? - any zero or more chars as few as possible
<pre> - a <pre> text
(.*?) - capturing group #1: any zero or more chars as few as possible
</pre> - a </pre> text
.* - any zero or more chars as many as possible (as * is a greedy quantifier).

The $1 in the replacement pattern will restore Group 1 value in the result (so, it will remain).

Thank you for providing the extra content and giving me a better oversight. I usually only use regex to match what I'm looking for in strings, but that's about it. This really help a lot! Thank you! — Danijel-James W, Jun 19 '20 at 12:27

PowerShell Regex Ignore up until character string match including string match

EDIT:

1 Answers1