How to work with xpath on strings containing html in Powershell?

Question

I want to extract values from a html document and in another program (ui.vision / selenium) I can do it with xpath statements. I have worked out a whole lot of working xpaths, and now I want to use them in Powershell. I have the string $html containing everything from <html> to </html> (incl.).

As far as I have researched, I need to have an xml object to use 'Select-Xml' with xpath statements.

In order to convert $html to xml I tried to cast:

[xml]$xml = $html

as well as

 $xml = [xml]$html

and I tried to convert:

$html = $html | ConvertTo-xml

All failed. I think that the html needs to be very well-formatted, but it is not (even if it's perfect html and passes the W3 validator without warnings). It's minified and most attributes lack parentheses.

So how can I get xpath to work on a string containing a html website? I am about to resort to regular expressions, but it seems to be a lot of work to translate all the xpath statements.

mklement0 · Answer 1 · 2022-08-09T20:43:10.237

HTML documents (except the XHTML variant, which is rarely seen these days) are not valid XML and therefore cannot be parsed as such.

A third-party HTML parsing library that provides an API that is similar to the standard [xml] (System.Xml.XmlDocument) API and therefore includes XPath support via methods such as .SelectNodes() is the HTML Agility Pack, for which a PowerShell wrapper, the PowerHTML module, exists - see this answer for an example of its use.

Caveat:

The PowerHTML module hasn't been updated in a while, and, as of this writing, the bundled library version is v1.7.0, whereas the current library version is 1.11.43.
You don't strictly need the wrapper module, but it makes use from PowerShell easier.

How to work with xpath on strings containing html in Powershell?

1 Answers1