Is it possible to do DOM traversal in a webpage saved as .mht, or saved as .htm (html only)?
Preferably in powershell or .net
Goal is to be able to do something like getElementsByTagName('div')
If yes, how?
Asked
Active
Viewed 181 times
0

Bjorn Mistiaen
- 6,459
- 3
- 18
- 42
1 Answers
1
Found a solution using HtmlAgilityPack.
Documentation can be found on NuDoq, which was mentioned in this post.
Example code:
# Choose a source
$Source = 'C:\temp\myFile.mht'
$Source = 'http://www.google.com'
# Get online or mht content
$IE = New-Object -ComObject InternetExplorer.Application
# Don't show the browser
$IE.Visible = $false
# Browse to your webpage/file
$IE.Navigate($Source)
# Wait for page to load
while ($IE.busy) { Sleep -Milliseconds 50 }
# Get the html from that page
$Html = $IE.Document.body.parentElement.outerHTML
# Decode to get rid of html encoded characters like & etc...
$Html = [System.Web.HttpUtility]::HtmlDecode($Html)
# Close the browser
$IE.Quit()
# Use HtmlAgilityPack (must be installed first)
Add-Type -Path (Join-Path $Env:userprofile '.nuget\packages\htmlagilitypack\1.4.9.5\lib\Net40\HtmlAgilityPack.dll')
$Hap = New-Object HtmlAgilityPack.HtmlDocument
# Load the Html in HtmlAgilityPack to get a DOM
$Hap.LoadHtml($global:Html)
# Retrieve the data from the DOM (read a node)
[string]$partData = $Hap.DocumentNode.SelectSingleNode("//div[@class='formatted_content']/ul").InnerText

Community
- 1
- 1

Bjorn Mistiaen
- 6,459
- 3
- 18
- 42