0

I have a directory of similar structured HTML files (two examples given):

File-1.html

<html>
    <body>
        <div class="foo">foo</div>
        <div class="bar"><div><p>bar</p></div></div>
        <div class="baz">baz</div>
    </body>
</html>

File-2.html

<html>
    <body>
        <div class="foo">foo</div>
        <div class="bar"><div><p>apple<br>banana</p></div></div>
        <div class="baz">baz</div>
    </body>
</html>

I am trying to create a Powershell script to return the contents of the bar div, stripped from all html:

For File-1.html: bar

For File-2.html: apple banana

I now have:

$directory = "C:\Users\Public\Documents\Sandbox\HTML"

foreach ($file in Get-ChildItem($directory))
{
    $content = Get-Content $file.fullname

    $test = [regex]::matches($content, '(?i)<div class="bar">(.*)</div>')

    echo $test[0]
}

This returns however <div class="bar"><div><p>bar</p></div></div><div class="baz">baz</div>. In other words, the regex does not stop until the last </div>. How can I let it only grab what in the <div class="bar"> div?

Pr0no
  • 3,910
  • 21
  • 74
  • 121

1 Answers1

2

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use *? for a non-greedy match meaning "zero or more — preferably as few as possible".

(?si)<div class="bar">(.*?)</div>
hwnd
  • 69,796
  • 4
  • 95
  • 132