1

I want to extract values from a html document and in another program (ui.vision / selenium) I can do it with xpath statements. I have worked out a whole lot of working xpaths, and now I want to use them in Powershell. I have the string $html containing everything from <html> to </html> (incl.).

As far as I have researched, I need to have an xml object to use 'Select-Xml' with xpath statements.

In order to convert $html to xml I tried to cast:

[xml]$xml = $html

as well as

 $xml = [xml]$html

and I tried to convert:

$html = $html | ConvertTo-xml

All failed. I think that the html needs to be very well-formatted, but it is not (even if it's perfect html and passes the W3 validator without warnings). It's minified and most attributes lack parentheses.

So how can I get xpath to work on a string containing a html website? I am about to resort to regular expressions, but it seems to be a lot of work to translate all the xpath statements.

jamacoe
  • 519
  • 4
  • 16

1 Answers1

1

HTML documents (except the XHTML variant, which is rarely seen these days) are not valid XML and therefore cannot be parsed as such.

A third-party HTML parsing library that provides an API that is similar to the standard [xml] (System.Xml.XmlDocument) API and therefore includes XPath support via methods such as .SelectNodes() is the HTML Agility Pack, for which a PowerShell wrapper, the PowerHTML module, exists - see this answer for an example of its use.

Caveat:

  • The PowerHTML module hasn't been updated in a while, and, as of this writing, the bundled library version is v1.7.0, whereas the current library version is 1.11.43.
  • You don't strictly need the wrapper module, but it makes use from PowerShell easier.
mklement0
  • 382,024
  • 64
  • 607
  • 775