1

I have the following html table:

<!DOCTYPE html>
<html>
<head>
<style>
table {
  font-family: arial, sans-serif;
  border-collapse: collapse;
  width: 100%;
}

td, th {
  border: 1px solid #dddddd;
  text-align: left;
  padding: 8px;
}

tr:nth-child(even) {
  background-color: #dddddd;
}
</style>
</head>
<body>

<h2>HTML Table</h2>

<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
  <tr>
    <td>Ernst Handel</td>
    <td>Roland Mendel</td>
    <td>Austria</td>
  </tr>
  <tr>
    <td>Island Trading</td>
    <td>Helen Bennett</td>
    <td>UK</td>
  </tr>
  <tr>
    <td>Laughing Bacchus Winecellars</td>
    <td>Yoshi Tannamuri</td>
    <td>Canada</td>
  </tr>
  <tr>
    <td>Magazzini Alimentari Riuniti</td>
    <td>Giovanni Rovelli</td>
    <td>Italy</td>
  </tr>
</table>

</body>

I would like to match all occurrences of <th>table headers</th> and <td>table data</td>.

For the <td>table data</td> I have managed to invoke a webrequest, got the html file and am now in the process of extracting the table contents:

$Table = $Data.Content
$NumberOfColumns = ($Table | Select-String "<th>" -AllMatches).Matches.Count
$NumberOfRows = ($Table | Select-String "<td>" -AllMatches).Matches.Count

$AllMatches = @()
$Found = $Table -match "(?<=<td>)[a-zA-Z0-9 _-]{1,99}(?=</td>)"
ForEach ($Row in $NumberOfRows)
{
    If ($Found -eq $True)
    {
        $AllMatches += $Matches
    }
}
$AllMatches

I get this output:

Name                     Value
----                          -----
0                              Alfreds Futterkiste

I would like to get a list of all of the matches embedded in th and td (I am running Powershell Core 6.2, so the ParsedHtml method is not an option. I would like to parse the table manually).

Any suggestions are greatly appreciated.

  • 2
    [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. https://gallery.technet.microsoft.com/Powershell-Tip-Parsing-49eb8810 – Toto Jan 04 '20 at 13:37
  • Does this work with PSCore 6.2? I don't believe it does as it relies on the IE Parser, or am I wrong? –  Jan 04 '20 at 13:54
  • The `HTMLFile` COM object should be available on any regular Windows system, but it won't give you the latest DOM API methods (things like `querySelector` will be unavailable). – Tomalak Jan 04 '20 at 14:01
  • 1
    Other than that, using the HTML Agility Pack (or another parser, although I believe this one is the most mature and versatile one for .NET) is the only sane option you have. – Tomalak Jan 04 '20 at 14:14
  • Thanks! I'll look into the HTML Agility Pack. The DOM-based parsers only work up to Windows Powershell 5.1, but are apparently won't run with PS Core –  Jan 04 '20 at 14:38
  • 1
    Use a competent HTML parser. https://stackoverflow.com/a/1732454/447901 – lit Jan 04 '20 at 22:00

2 Answers2

0

As stated, it's best to use a dedicated HTML parser to parse HTML text, because manual parsing based on regexes is fragile and has severe limitations.

However, given that PowerShell Core (v6+) has no built-in HTML parsing and given that your parsing requirements are simple, you can get away with regex-based parsing in this case:

$Table = $Data.Content

# Get all <th> values (column names) and count them.
$colNames = [regex]::Matches($Table, '(?<=<th>).+?(?=</th>)').Value
$colCount = $colNames.Count

# Create an ordered hashtable with the column names as keys 
# to serve as the template for the output objects.
$oht = [ordered] @{}
foreach ($col in $colNames) { $oht[$col] = $null }

# Get all <td> values (row values).
$rowValues = [regex]::Matches($Table, '(?<=<td>).+?(?=</td>)').Value

# Construct custom objects whose properties are named for the column names
# and whose values are the row values.
$i = 0
foreach ($val in $rowValues) {
  # Assing the row value to the column-appropriate hashtable entry.
  $oht[$i % $colCount] = $val
  if ($i % $colCount -eq ($colCount - 1)) {
    # The last property for the row at hand was filled,
    # construct and output a custom object from the hashtable.
    [pscustomobject] $oht
  }
  ++$i
}

The above yields an array of [pscustomobject] instances, which print to the display as follows (to capture the array in a variable, simply use $objs = foreach ($val in $rowValues) ...):

Company                      Contact          Country
-------                      -------          -------
Alfreds Futterkiste          Maria Anders     Germany
Centro comercial Moctezuma   Francisco Chang  Mexico
Ernst Handel                 Roland Mendel    Austria
Island Trading               Helen Bennett    UK
Laughing Bacchus Winecellars Yoshi Tannamuri  Canada
Magazzini Alimentari Riuniti Giovanni Rovelli Italy

If you're willing to install a third-party HTML parser on demand, here's a solution using the HTML Agility Pack NuGet package:

Note: On Windows, with version 1.4.6 of the PackageManagement module, installing the NuGet package can fail, complaining about a dependency loop - Unix platforms are not affected. If needed, manually download the NuGet package from https://www.nuget.org/packages/HtmlAgilityPack/.

# Install the HtmlAgilityPack NuGet package on demand.
if (-not (Get-Package -ea Ignore HtmlAgilityPack)) {
  # Make sure that NuGet is registered as a package source.
  if (-not (Get-PackageSource -ea Ignore nuget.org)) {
    $null = Register-PackageSource -ea Stop -ProviderName nuget -name nuget.org -Location https://www.nuget.org/api/v2 -Trusted
  }
  $null = Install-Package -ea Stop HtmlAgilityPack -Scope CurrentUser -Provider NuGet
}

# Load the HtmlAgilityPack assemblies into the current session.
Add-Type -ea Stop -Path ((Get-Package HtmlAgilityPack).Source + '/../lib/netstandard2.0/HtmlAgilityPack.dll')

$Table = $Data.Content

# Create an HTML DOM object and parse the HTML text into it.
$d = [HtmlAgilityPack.HtmlDocument]::new()
$d.LoadHtml($Table)

# Get all <th> values (column names) and count them.
$colNames = $d.DocumentNode.SelectNodes('html/body/table//th').InnerText
$colCount = $colNames.Count

# Create an ordered hashtable with the column names as keys 
# to serve as the template for the output objects.
$oht = [ordered] @{}
foreach ($col in $colNames) { $oht[$col] = $null }

# Get all <td> values (row values).
$rowValues = $d.DocumentNode.SelectNodes('html/body/table//td').InnerText

# Construct custom objects whose properties are named for the column names
# and whose values are the row values.
$i = 0
foreach ($val in $rowValues) {
  # Assing the row value to the column-appropriate hashtable entry.
  $oht[$i % $colCount] = $val
  if ($i % $colCount -eq ($colCount - 1)) {
    # The last property for the row at hand was filled,
    # construct and output a custom object from the hashtable.
    [pscustomobject] $oht
  }
  ++$i
}

The output is the same as above.

Note how .SelectNodes() is used with an XPath query to extract the nodes of interest.

mklement0
  • 382,024
  • 64
  • 607
  • 775
0

If your table is as close to XML as in the example, you might directly parse it to XML using the [System.Net.WebUtility]::HtmlDecode Method:

If ($Data.Content -Match '<table>[\s\S]*<\/table>') {
    [xml]$Xml = [System.Net.WebUtility]::HtmlDecode($Matches[0])
    $Header = $Null
    $Xml.DocumentElement.SelectNodes('//tr') | ForEach-Object {
        If ($Null -eq $Header) {
            $Header = $_.GetElementsByTagName('th').'#text'
        } Else {
            $i = 0; $Property = [Ordered]@{}
            $_.GetElementsByTagName('td').'#text'.ForEach{ $Property[$Header[$i++]] = $_ }
            [PSCustomObject]$Property
        }
    }
}

(Tested with PowerShell Core on Windows and Raspbian)

iRon
  • 20,463
  • 10
  • 53
  • 79