0

I have a directory of similar structured HTML files (two examples given):

File-1.html

<html>
    <body>
        <div class="foo">foo</div>
        <div class="bar"><div><p>bar</p></div></div>
        <div class="baz">baz</div>
    </body>
</html>

File-2.html

<html>
    <body>
        <div class="foo">foo</div>
        <div class="bar"><div><p>apple<br>banana</p></div></div>
        <div class="baz">baz</div>
    </body>
</html>

I am trying to create a Powershell script to return the contents of the bar div, stripped from all html:

For File-1.html: bar For File-2.html: apple banana

I now have:

$directory = "C:\Users\Public\Documents\Sandbox\HTML"

foreach ($file in Get-ChildItem($directory))
{
    $content = Get-Content "$directory\$file"

    echo $content.ParsedHtml.getElementById("bar").innerHTML
}

This returns an error:

You cannot call a method on a null-valued expression. 
At C:\Users\Public\Documents\Sandbox\parse-html.ps1:9 char:2 
+     echo $content.ParsedHtml.getElementById("bar").innerHTML`

I don't understand this error, as bar is an HTML element that exists.

What am I doing wrong?

Pr0no
  • 3,910
  • 21
  • 74
  • 121
  • 1
    does $content have a value? The error tells You that the variable You are calling a method on Is null – Paul Oct 20 '14 at 12:36
  • Yes, when I do an `echo $content`, the HTML for File-1.html is returned. – Pr0no Oct 20 '14 at 12:39
  • 1
    ok what about $content.ParsedHtml? – Paul Oct 20 '14 at 12:41
  • That value is null. I don't understand why. – Pr0no Oct 20 '14 at 12:48
  • 1
    because $content does not have a property called ParsedHtml, powershell does not support parsing html files by default i think. you can try using http://htmlagilitypack.codeplex.com/. Or you could just treat the line as the string it is and try to get the content of the tag with a REGEX – Paul Oct 20 '14 at 12:54
  • Some help with my regex please? I'm new to regex and what I have now does not return anything: `$test = [regex]::matches($content, '(?<=
    \s+)(.*?)
    ')` I think I was doing rather well :-)
    – Pr0no Oct 20 '14 at 13:18
  • 1
    Haha :) Sorry i´m not a regex guru myself :) The only tip i can give you is to try regex101.com, it shows syntax errors and stuff. also look here: http://stackoverflow.com/questions/11306596/regex-to-extract-the-contents-of-a-div-tag – Paul Oct 20 '14 at 13:24

2 Answers2

0

You can try something like this:

 $content = Get-Content File-1.html
 $xmlContent = [xml]$content

 $bar = $xmlContent.html.body.div | where {$_.div -eq 'bar'}

 Write-Output $bar.InnerXML
mhatch73
  • 92
  • 5
0

You can do it like this:

$text = Get-Content File-1.html
$html = New-Object -ComObject "HTMLFile"
$html.IHTMLDocument2_write($text)
$bar = $html.body.getElementsByClassName('bar')[0]
$bar.innerText
Carsten
  • 1,612
  • 14
  • 21