-1

Considering the following input HTML:

<div class='content'>
    <img style='border-style: solid; border-width: 1px;' src='/media/uploads/defaults/181'/><br/><br/>
    <div class='imgCaption'>
        Reverse Osmosis Caption
    </div>
</div>
<pagebreak/>
<h3>Access </h3>
<h4>Type</h4>
<div class='content'>
    Your plumbing system is accessible with a Main Shut off Valves
</div>
<h4>Location</h4>
<pagebreak/>
<h3>Operation & maintenance #1</h3>
<div class='content'>
    All wastewater treatment systems and their components require regular maintenance.
</div>
<h4>Activity</h4>

So I need to find all h4 headers that are not followed by the div of class "content". (In this example, it's "h4 Activity /h4" at the very bottom).

My regex

/<h4>.*<\/h4>(?!<div class='content'>)/

captures everything after

<h4>Type</h4>

Which makes sense since it's followed by not just "div class='content'".

So my question is how I can re-write the query so it only picks up the headers that are not followed by div of class content.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
Maxim Pak
  • 158
  • 1
  • 8

1 Answers1

1

You need to add .*? at the first inside the negative lookahead assertion. If you fail to add .*?, negative lookahead will check for the immediate following of <div class='content'> tag.

<h4>(?:(?!<\/?h4>).)*?<\/h4>(?!.*?<div class='content'>)

DEMO

It will match the last h4 tag because it isn't followed by any <div class='content'> tag.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274