2

So I'm trying to get some text from a website, and as soon as I try to return an object with ParsedHtml, powershell stops responding (even when I let it run in the background a few minutes it wont do anything anymore). What can be the cause of that?

PS P:\> $url = "mywebsite"
PS P:\> $result = invoke-WebRequest $url
PS P:\> $result | Get-Member

TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject

Name              MemberType Definition
----              ---------- ----------
Dispose           Method     void Dispose(), void IDisposable.Dispose()
Equals            Method     bool Equals(System.Object obj)
GetHashCode       Method     int GetHashCode()
GetType           Method     type GetType()
ToString          Method     string ToString()
AllElements       Property               
Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {get;}
BaseResponse      Property   System.Net.WebResponse BaseResponse {get;set;}
Content           Property   string Content {get;}
Forms             Property           
Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;}
Headers           Property           
System.Collections.Generic.Dictionary[string,string] Headers {get;}
Images            Property   
Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;}
InputFields       Property   
Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {get;}
Links             Property       
Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;}
ParsedHtml        Property   mshtml.IHTMLDocument2 ParsedHtml {get;}
RawContent        Property   string RawContent {get;set;}
RawContentLength  Property   long RawContentLength {get;}
RawContentStream  Property   System.IO.MemoryStream RawContentStream {get;}
Scripts           Property       
Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;}
StatusCode        Property   int StatusCode {get;}
StatusDescription Property   string StatusDescription {get;}

PS P:\> $result.ParsedHtml | Get-Member

And then the programm freezes after last command. A popup pops up asking me if its allowed to save cookies on my pc, but neither clicking yes nor no will help anything.. What can be the cause of this?

$result.RawContent

for example works just fine and prints out all of the html text, but has no getelementsby-Method, which I guess is in ParsedHtml, hence why I would need it..it works for example on youtube but on a specific site i want to check it freezes. Any help is greatly appreciated!

btec
  • 31
  • 5
  • Try adding the `-UseBasicParsing` switch on your `Invoke-WebRequest` call. – gvee May 15 '18 at 07:29
  • Just tried, the ParsedHtml doesnt exist in $result then anymore. Give missing object error when trying to access $result.ParsedHtml, also it doesnt appear when I use $result | Get-Member method anymore – btec May 15 '18 at 07:58
  • Believe this is caused by the security settings in Internet Explorer. When it parses the HTML it uses IE, and also its security settings. – Zerqent May 15 '18 at 10:46
  • i try to work around now. any other way to get out specific words from a table on the website? – btec May 15 '18 at 11:11
  • @btc ever found a solution to this? – David Trevor Jun 28 '19 at 15:07

1 Answers1

1

From Invoke-WebRequest reference page on learn.microsoft.com:

This parameter has been deprecated. Beginning with PowerShell 6.0.0, all Web requests use basic parsing only. This parameter is included for backwards compatibility only and any use of it has no effect on the operation of the cmdlet.

and a more detailed explanation from MS staff comment on PowerShell Github repository Issue #2867:

Windows PowerShell relied on Internet Explorer to parse the html. Since Internet Explorer wasn't available in most platforms we support with PowerShell Core 6 (nanoserver, Linux, macOS), it made sense to default to -UseBasicParsing. @MSAdministrator's proposal for ConvertFrom-Html is a better solution rather than marrying the parsing capability to the web cmdlets (like parsing a local html file). and later: Seems that the community has helped fill in this gap with modules on PowerShellGallery to specifically handle parsing html.

And today there does not seem to be a ConvertFrom-Html, so I guess your choices are: PowerShell Gallery modules that provide parsing, or a limited alternative follows. It looks like they won't give you the ParsedHTML property per se, but they do give you some traversable/structured content that might serve your purposes:

https://stackoverflow.com/a/53878303/537243

In very, very limited circumstances you could try and make use of the way that "html is a subtype of xml", but xml parser will get confused and fail with a lot of syntax "deviations" permitted in html, so the source has to be very regular and very vanilla:

$webresponse = Invoke-WebRequest -Uri "https://w3.org"
$xmldoc = [xml]$webresponse.Content
write-output $xmldoc.html.body.div[0].div.h1.span |select '#text'
Justin
  • 397
  • 7
  • 17