Regex: Starting from a specific point on each line

Question

I have an HTML file that displays software installed on a machine, and I'd like to remove some of the cells in the table in the HTML file. Below is a sample of the code:

<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>

...and so on.

What I'm trying to accomplish is to delete everything starting from the 4th instance of the td tag and stop just before the closing /tr tag on each line, so essentially eliminating...

<td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td>
<td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td>

...so that I'm left with...

<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td></tr>

The regex that I'm using is

(?<=<td>)(.*)(?=<\/tr>)

The issue I'm having is that the above regex is selecting the enitre line of code. How can I change this so that it's starting from the 4th instance of the tag for each line?

Please see the following link with a full example of the HTML file I'm using and the regex applied: https://regex101.com/r/C9lkMc/3

EDIT 1: This HTML is generated from a PowerShell script to fetch installed software on remote machines. The code for that is:

    Invoke-Command -ComputerName $hostname -ScriptBlock {
    if (!([Diagnostics.Process]::GetCurrentProcess().Path -match '\\syswow64\\')) {

        $unistallPath = "\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\"
        $unistallWow6432Path = "\SOFTWARE\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall\"
        @(
            if (Test-Path "HKLM:$unistallWow6432Path" ) { Get-ChildItem "HKLM:$unistallWow6432Path"}
            if (Test-Path "HKLM:$unistallPath" ) { Get-ChildItem "HKLM:$unistallPath" }
            if (Test-Path "HKCU:$unistallWow6432Path") { Get-ChildItem "HKCU:$unistallWow6432Path"}
            if (Test-Path "HKCU:$unistallPath" ) { Get-ChildItem "HKCU:$unistallPath" }
        ) |
            ForEach-Object { Get-ItemProperty $_.PSPath } |
            Where-Object {
            $_.DisplayName -and !$_.SystemComponent -and !$_.ReleaseType -and !$_.ParentKeyName -and ($_.UninstallString -or $_.NoRemove)
        } |
            Sort-Object DisplayName | Select-Object -Property DisplayName, DisplayVersion, InstallDate | ft
    }
}

[H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - use a parser — ctwheels, Jan 25 '18 at 15:33
[**TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ**](https://stackoverflow.com/a/1732454/1954610) ... Don't use regex, if you can help it. What language are you using? Can you use an HTML parser? — Tom Lord, Jan 25 '18 at 15:33
I'm using a PowerShell script to generate the HTML. The part of the code that handles this is in the updated question. — obs0lete, Jan 25 '18 at 15:40

score 1 · Accepted Answer · answered Jan 25 '18 at 16:53

Regex isn't great for parsing HTML; there can be a lot of odd scenarios; e.g. what happens if you have a node <td /> or <td colspan="2"> where you'd expected to have <td>? Equally, HTML (annoyingly) doesn't always follow XML rules; so an XML parser won't work (e.g. <hr> has no end tag / <hr /> is considered invalid).

As such, if parsing HTML you ideally need to use an HTML parser. For that, PowerShell has access to the HtmlFile com object, documented here: https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx

Here are some examples...

This code finds all TR elements then strips all TDs after the first 4 and returns the row's outer HTML.

$html = @'
some sort of html code
<hr> an unclosed tab so it's messy like html / unlike xml
<table>
<tr><th>Program Name</th><th>version</th><th>install date</th><th>computer name</th><th>ID</th><th>Installed</th></tr>
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td /><td>123</td><td></td><td>hello.com</td><td>456</td><td>True</td></tr>
</table>
etc...
'@

$Parser = New-Object -ComObject 'HTMLFile' #see https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
$Parser.IHTMLDocument2_write($html) #if you're using PS4 or below use instead: $Parser.IHTMLDocument2_write($html)

$parser.documentElement.getElementsByTagName('tr') | %{
    $tr = $_
    $tr.getElementsByTagName('td') | select-object -skip 4 | %{$tr.removeChild($_)} | out-null
    $tr.OuterHtml
}

This works in a similar way; but just pulls back the values of the first 4 cells in each row:

$html = @'
some sort of html code
<hr> an unclosed tab so it's messy like html / unlike xml
<table>
<tr><th>Program Name</th><th>version</th><th>install date</th><th>computer name</th><th>ID</th><th>Installed</th></tr>
<tr><td>Adobe Acrobat Reader DC</td><td>18.009.20050</td><td>20171130</td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td>Adobe Flash Player 28 ActiveX</td><td>28.0.0.137</td><td></td><td>kratos.kcprod1.com</td><td>4104917a-93f2-46e5-941a-c4efd54504b7</td><td>True</td></tr>
<tr><td /><td>123</td><td></td><td>hello.com</td><td>456</td><td>True</td></tr>
</table>
etc...
'@

$Parser = New-Object -ComObject 'HTMLFile' #see https://msdn.microsoft.com/en-us/library/aa752574(v=vs.85).aspx
$Parser.IHTMLDocument2_write($html) #if you're using PS4 or below use instead: $Parser.IHTMLDocument2_write($html)

$parser.documentElement.getElementsByTagName('tr') | %{
    $tr = $_
    $a,$b,$c,$d = $tr.getElementsByTagName('td') | select-object -first 4 | %{"$($_.innerText)"} #we do this istead of `select -expand innerText` to ensure nulls are returned as blanks; not ignored
    (New-Object -TypeName 'PSObject' -Property ([ordered]@{
        AppName = $a
        Version = $b
        InstallDate = $c
        ComputerName = $d
    }))
}

That is fantastic! Thank you. – obs0lete Jan 25 '18 at 18:11 — obs0lete, Jan 25 '18 at 18:11

Regex: Starting from a specific point on each line

1 Answers1