9

I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far

$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)


foreach ($obj in $HTML.all) { 
    $obj.getElementsByClassName('some-class-name') 
}

I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.

So after spending two days, how am I supposed to parse HTML with Powershell?

So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.

David Trevor
  • 794
  • 1
  • 7
  • 22

2 Answers2

8

Since noone else has posted an answer, I managed to get a working solution with the following code:

$request = Invoke-WebRequest -Uri $URL -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
[string]$htmlBody = $request.Content
$HTML.write([ref]$htmlBody)
$filter = $HTML.getElementsByClassName($htmlClassName)

With some URLs I experienced that the $filter variable was empty while it was populated for other URLs. All in all this might work for your situation but it seems like Powershell isn't the way to go for more complex parsing.

David Trevor
  • 794
  • 1
  • 7
  • 22
  • 3
    I would point out that this solution works only on PowerShell deployed on Windows. The COM objects are not available in PowerShell v7.x.x generally. – KUTlime Jun 01 '21 at 18:56
  • Use [this answer](https://stackoverflow.com/a/24989452/11942268), if `.write()` throws an error. – stackprotector Sep 20 '21 at 19:11
5

In 2020 with PowerShell 5+ you do it like this:

$searchClass = "banana" <# in this example we parse all elements of class "banana" but you can use any class name you wish #>
$myURI = "url.com" <# replace url.com with any website you want to scrape from #>

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 <# using TLS 1.2 is vitally important #>
$req = Invoke-Webrequest -URI $myURI
$req.ParsedHtml.getElementsByClassName($searchClass) | %{Write-Host $_.innerhtml}

#for extra credit we can parse all the links
$req.ParsedHtml.getElementsByTagName('a') | %{Write-Host $_.href} #outputs all the links

Krzysztof Madej
  • 32,704
  • 10
  • 78
  • 107
Ben R
  • 69
  • 1
  • 3
  • When I look up IHTMLDocument2 I only see 2 methods, write and close. Where is getElementsByClassName declared? How do I find what other methods are available to the ParsedHtml property? – silicontrip Jul 23 '20 at 05:53
  • 15
    in 2020 with powershell 7.0.3 this unfortunately doesn't work. the response ("$req") will not have a property called ParsedHtml. Is this a powershell-classic-only feature? – Chris Sep 30 '20 at 19:48
  • try ```$req = Invoke-Webrequest -URI $myURI -usebasicparsing``` – Ben R Mar 10 '21 at 17:41
  • 1
    @BenR "This parameter has been deprecated. Beginning with PowerShell 6.0.0, all Web requests use basic parsing only. This parameter is included for backwards compatibility only and any use of it has no effect on the operation of the cmdlet." – N. I. Nov 17 '21 at 16:04