I have a directory of similar structured HTML files (two examples given):
File-1.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>bar</p></div></div>
<div class="baz">baz</div>
</body>
</html>
File-2.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>apple<br>banana</p></div></div>
<div class="baz">baz</div>
</body>
</html>
I am trying to create a Powershell script to return the contents of the bar
div, stripped from all html:
For File-1.html: bar
For File-2.html: apple banana
I now have:
$directory = "C:\Users\Public\Documents\Sandbox\HTML"
foreach ($file in Get-ChildItem($directory))
{
$content = Get-Content $file.fullname
$test = [regex]::matches($content, '(?i)<div class="bar">(.*)</div>')
echo $test[0]
}
This returns however <div class="bar"><div><p>bar</p></div></div><div class="baz">baz</div>
. In other words, the regex does not stop until the last </div>
. How can I let it only grab what in the <div class="bar">
div?