How to parse HTML table with Powershell Core 7?

Question

I have the following code:

    $html = New-Object -ComObject "HTMLFile"
    $source = Get-Content -Path $FilePath -Raw
    try
    {
        $html.IHTMLDocument2_write($source) 2> $null
    }
    catch
    {
        $encoded = [Text.Encoding]::Unicode.GetBytes($source)
        $html.write($encoded)
    }
    $t = $html.getElementsByTagName("table") | Where-Object {
        $cells = $_.tBodies[0].rows[0].cells
        $cells[0].innerText -eq "Name" -and
        $cells[1].innerText -eq "Description" -and
        $cells[2].innerText -eq "Default Value" -and
        $cells[3].innerText -eq "Release"
    }

The code works fine on Windows Powershell 5.1, but on Powershell Core 7 $_.tBodies[0].rows returns null.

So, how does one access the rows of an HTML table in PS 7?

See also: [Extracting HTML table as CSV](https://stackoverflow.com/a/67162906/1701026) — iRon, May 08 '21 at 15:26

mklement0 · Accepted Answer · 2023-05-24T18:18:51.710

PowerShell (Core), as of 7.3.4, does not come with a built-in HTML parser - and this may never change.

You must rely on a third-party solution, such as the PowerHTML module that wraps the HTML Agility Pack.

The object model works differently than the Internet Explorer-based one available in Windows PowerShell; it is similar to the XML DOM provided by the standard System.Xml.XmlDocument type ([xml])^[1]; see the documentation and the sample code below.

# Install the module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PowerHTML)) {
  Write-Verbose "Installing PowerHTML module for the current user..."
  Install-Module PowerHTML -ErrorAction Stop
}
Import-Module -ErrorAction Stop PowerHTML

# Create a sample HTML file with a table with 2 columns.
Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html

# Parse the HTML file into an HTML DOM.
$htmlDom = ConvertFrom-Html -Path sample.html

# Find a specific table by its column names, using an XPath
# query to iterate over all tables.
$table = $htmlDom.SelectNodes('//table') | Where-Object {
  $headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
  # Filter by column names
  $headerRow.ChildNodes[0].InnerText -eq 'Name' -and 
    $headerRow.ChildNodes[1].InnerText -eq 'Mode'
}

# Print the table's HTML text.
$table.InnerHtml

# Extract the first data row's first column value.
# Note: @(...) is required around .Elements() for indexing to work.
@($table.Elements('tr'))[1].ChildNodes[0].InnerText

A Windows-only alternative is to use the HTMLFile COM object, as shown in this answer, and as used in your own attempt - I'm unclear on why it didn't work in your specific case.

^{[1] Notably with respect to supporting XPath queries via the .SelectSingleNode() and .SelectNodes() methods, exposing child nodes via a .ChildNodes collection, and providing .InnerHtml / .OuterHtml / .InnerText properties. Instead of an indexer that supports child element names, methods .Element(<name>) and .Elements(<name>) are provided.}

score 0 · Answer 2 · edited Nov 10 '22 at 23:06

I used the answer above for my solution. I installed PowerHTML. I wanted to extract the datatable from https://www.dicomlibrary.com/dicom/dicom-tags/ and convert them.

From this:

<tr><td>(0002,0000)</td><td>UL</td><td>File Meta Information Group Length</td><td></td></tr>

To this:

{"00020000", "ULFile Meta Information Group Length"}

$page = Invoke-WebRequest https://www.dicomlibrary.com/dicom/dicom-tags/
$htmldom = ConvertFrom-Html $page
$table = $htmlDom.SelectNodes('//table') | Where-Object {
  $headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
  # Filter by column names
  $headerRow.ChildNodes[0].InnerText -eq 'Tag' 
}

foreach ($row in $table.SelectNodes('tr'))
 {$a = $row.SelectSingleNode('td[1]').innerText.Trim()  -replace "`n|`r|\s+", " " -replace "\(",'{"' -replace ",","" -replace "\)",'",'
 $c = $row.SelectSingleNode('td[3]').innerText.Trim() -replace "`n|`r|\s+", " "
 $b=$row.seletSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", ""; $c = '"'+$b+$c+'"},'
 $row = New-Object -TypeName psobject
     $row | Add-Member -MemberType NoteProperty -Name Tag -Value $a
     $row | Add-Member -MemberType NoteProperty -Name Value -Value $c

     [array]$data += $row
}

$data | Out-File c:\scripts\dd.txt

How to parse HTML table with Powershell Core 7?

2 Answers2

Linked