4

My goal is to parse an html file retrieved with Invoke-WebRequest. If possible I'd like to avoid any external libraries.

The problem I am facing is, that Invoke-WebRequest returns a BasicHtmlWebResponseObject instead of a HtmlWebResponseObject since Powershell 6. The Basic version misses the ParsedHtml property. Is there a good alternative to parse html in Powershell Core 6?

I've tried to use Select-Xml but my html is not entirely valid (e.g. a missing closing tag), hence this fails to parse the result.

Another alternative I've found is to use New-Object -ComObject "HTMLFile" but from my understanding this relies on Internet Explorer for parsing which I'd like to avoid.

There is a very similar question here but sadly this question had no answer or activity since 8 months.

Jannik
  • 1,583
  • 1
  • 14
  • 22
  • 1
    This one likely isn't going to get any activity either. That parsing functionality relies on IE components, an obvious non-starter for Core. The chance of someone reimplementing that stuff from scratch and putting it in the base implementation instead of requiring you to pull in an external lib for parsing is, well, small. There's an [issue](https://github.com/PowerShell/PowerShell/issues/2867) for it marked "up for grabs". It's close to its three year anniversary -- I don't see it getting grabbed any time soon, but who knows... – Jeroen Mostert Nov 01 '19 at 14:50
  • Thanks for this information. The thing that bothers me a bit is, that the `select-xml` command has the parsing capabilities but misses not-so-strict parsing. That's why I thought that there might exist an alternative. – Jannik Nov 01 '19 at 15:00
  • 1
    The thing is that the "not-so-strict parsing" that HTML requires is quite complicated, compared to the "carved in stone" XML standard, which is why implementing HTML parsing is really something best left to well curated libraries that take the time and effort to do this nontrivial thing that's subject to multiple interpretations. (HTML5 makes things a bit simpler by laying down a lot of the rules in a way we can all get behind.) PowerShell Classic pulled this in "for free" with the already written IE components that are part of the OS, but for Core the free lunch is over. – Jeroen Mostert Nov 01 '19 at 15:03
  • That sounds plausible, thanks. Then I'll probably have to look into some library. – Jannik Nov 01 '19 at 15:08

1 Answers1

3

As mentioned in the comments it is not really possible without a library. One very good library you could use it the AngleSharp library for dotnet. It has great html parsing capabilities and dotnet code interacts very friendly with powershell, have a look at this link.

Here is an example from their website:

var config = Configuration.Default.WithDefaultLoader();
var address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var cellSelector = "tr.vevent td:nth-child(3)";
var cells = document.QuerySelectorAll(cellSelector);
var titles = cells.Select(m => m.TextContent);
Karlheinz
  • 56
  • 3