0

Having a little trouble constructing a Powershell Replace regex that's not too greedy.

Looking to convert occurrences of this pattern: /sites/*/*/SitePages/*/*.aspx to: /sites/*/*/SitePages/*/*.html

But having an issue where there's multiple values on the one line to be replaced. replace's greediness is capturing the whole line, replacing only the last.

sample input:

<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>

failing regex segment:

% { $_ -Replace '(sites.*SitePages.*)\.aspx' , '${1}.html' }

Suggestions?

(motivation: I am trying to convert the aspx page references to html as we've moved from hosting on SharePoint. Pages are all static, so no issues, other than converting the page extensions)

Olaf
  • 4,690
  • 2
  • 15
  • 23
Ian W
  • 4,559
  • 2
  • 18
  • 37

4 Answers4

5

Just as you stated yourself, using a regular expression to peek and poke in a structured string might give unexpected and greedy results. As suggested before, it is generally a bad idea to attempt to parse HTML with regular expressions. Instead use a dedicated HTML parser as the HtmlDocument class (and the Uri class for uri's).

Example

$html = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}

$Document = ParseHtml $Html
# You might also select your div from a presumably larger document:
# $div = $Document.getElementsByClassName('ms-wikicontent')
$Document.getElementsByTagName('a') |ForEach-Object {
    if (([Uri]$_.href).LocalPath -like '/sites/*/*/SitePages/*.aspx') {
        $_.href = [System.IO.Path]::ChangeExtension($_.href, 'html')
    }
}
$Document.body.innerHtml

result:

<DIV class="ms-wikicontent ms-rtestate-field" style="PADDING-RIGHT: 10px">
<DIV class=ExternalClass8E56354CC4314DBA861E187B689F3A2B>
<TABLE id=layoutsTable style="WIDTH: 100%">
<TBODY>
<TR style="VERTICAL-ALIGN: top">
<TD style="WIDTH: 100%">
<DIV class=ms-rte-layoutszone-outer style="WIDTH: 100%">
<DIV aria-haspopup=true role=textbox aria-multiline=true class=ms-rte-layoutszone-inner aria-autocomplete=both><A id=0::Home|Home class=ms-wikilink href="/sites/Team/Project/SitePages/Home.html">Home</A> - <A id=1::Jenkins|Jenkins class=ms-wikilink href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</A>
<H1 class=ms-rteElement-H1>Jenkins Integration with Deployment Tools</H1></DIV></DIV></TR></TBODY></DIV></DIV>
iRon
  • 20,463
  • 10
  • 53
  • 79
  • 1
    Thank you, I will investigate this approach. I recognize parsing HTML w/regex is bad, but we are in a pinch in that the legacy SharePoint server is being decomm'd within the month and SAs have already made the servers read-only. Our goal was to salvage the legacy content such that we can port to another platform. As no migration options were provided (they assumed key data was all "shared Docs", not wiki, so just copy), the simplest way we extracted the content was Powershell `Get-Content`. Will explore your suggestion to perhaps also extract only the useful content. +1 for now. – Ian W Jun 05 '22 at 22:26
2

Without lookarounds, you can use a capture group like in your question. But when matching you should not cross the " as the string in between double quotes.

(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b

Explanation

  • ( Capture group 1
    • /sites\b Match sites and a word boundary
    • [^\"]*/SitePages/ Optionally match any char except " and then match /SitePages/
    • [^\"]+ Match 1+ chars other than "
  • ) Close group 1
  • \.aspx\b Match .aspx and a word boundary

See a regex demo.

$input = @"
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
"@

$input -replace '(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b' ,'$1.html'

Output

<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.html">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>

Another variation if there are always 2 parts with / you can do an exact repetition with a quantifier {2} and for example assert the double quote after .aspx

(/sites(?:/[^/\"]+){2}/SitePages/[^/\"]+)\.aspx(?=\")

See another regex demo.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 2
    General advice concerning PowerShell: do not use the variable name[`$input`](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_automatic_variables#input) as it is a predefined [automatic variable](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_automatic_variables). – iRon Jun 05 '22 at 13:14
  • 1
    @iRon Thanl you, I was not aware of that. – The fourth bird Jun 05 '22 at 13:24
1

Daniel already showed an excellent solution using character exclusion [^/]:

$_ -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html'

Alternatively you could use the lazy modifier ?:

$_ -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html'

While the latter looks cleaner, it is less performant, because it requires more backtracking.

I did a little benchmark:

$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'

$runs = 100000
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html' }}).TotalMilliseconds
$lazyMillis    = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html' }}).TotalMilliseconds

[PSCustomObject]@{
    RegExExclude = '{0} ms'        -f [int]$excludeMillis
    RegExLazy    = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}

Output from PS 7.2:

RegExExclude RegExLazy    
------------ ---------
281 ms       350 ms (125%)

The difference is noticable, but not that big, so you may go for readability if performance doesn't matter.


The performance difference between the two becomes even smaller when using a compiled RegEx:

$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'

$runs = 100000

$rxExclude = [regex]::new( '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$rxLazy    = [regex]::new( '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )

$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxExclude.Replace( $text, 'html' ) }}).TotalMilliseconds
$lazyMillis    = (Measure-Command { foreach( $i in 1..$runs ) { $rxLazy.Replace( $text, 'html' ) }}).TotalMilliseconds

[PSCustomObject]@{
    RegExExclude = '{0} ms'        -f [int]$excludeMillis
    RegExLazy    = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}

Output from PS 7.2:

RegExExclude RegExLazy
------------ ---------
160 ms       178 ms (111%)
zett42
  • 25,437
  • 3
  • 35
  • 72
-1

try

[string]$string = "<div class='ms-wikicontent ms-rtestate-field' style='padding-right: 10px'><div class='ExternalClass8E56354CC4314DBA861E187B689F3A2B'><table id='layoutsTable' style='width:100%'><tbody><tr style='vertical-align:top'><td style='width:100%'><div class='ms-rte-layoutszone-outer' style='width:100%'><div class='ms-rte-layoutszone-inner' role='textbox' aria-haspopup='true' aria-autocomplete='both' aria-multiline='true'><a id='0::Home|Home' class='ms-wikilink' href='/sites/Team/Project/SitePages/Home.aspx'>Home</a> - <a id='1::Jenkins|Jenkins' class='ms-wikilink' href='/sites/Team/Project/SitePages/Jenkins.aspx'>Jenkins</a><h1 class='ms-rteElement-H1'>Jenkins Integration with Deployment Tools</h1>"

$string.Replace('.aspx','.html')

or if you looking for build regex. Check out https://rubular.com/ it helps to build regex expressions

  • This would work if I went the trivial route and replaced ALL occurrences of .aspx, which would also replace references in the html body. Only interested in replacing those which are references to actual `.aspx` pages, all of which are stored in a path with `/sitePages/` the reference; see the pattern in the Q. – Ian W Jun 05 '22 at 01:22