0

What is the most efficient expression to solve finding context between multiple similar lazy quantifiers on a multi line.

I’m using HTML as the test subject so it is easier to comprehend rather the true format (symbols and character bytes), otherwise would be using a Xpath/DOM!

Sample data:

<div>
     Testing 1234 
     <div>Testing1234</div> and testing 
     <div>Testing1234</div> testing 1234
</div>

Desired result:

Testing 1234 
         <div>Testing1234</div> and testing 
         <div>Testing1234</div> testing 1234

PCRE Expressions

Base: /(<div>)(.*?)(<\/div>)/
Non capturing group: /(<div>)((?:<div>.*?<\/div>).*?)(<\/div>)/
Michael Mikhjian
  • 2,760
  • 4
  • 36
  • 51
  • 4
    Don't use regex to parse HTML – anubhava Jul 30 '23 at 17:08
  • 1
    Are you stuck with regex or is it possible for you to use xpath? – Monofuse Jul 30 '23 at 17:54
  • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Danny Fardy Jhonston Bermúdez Jul 30 '23 at 18:10
  • @anubhava I’m not parsing parsing HTML, but the test subject for ease of reading the problem is designed in HTML, figured that was easier to comprehend. – Michael Mikhjian Jul 30 '23 at 18:30
  • 1
    @Monofuse added more detail above. This is not a xpath solution. – Michael Mikhjian Jul 30 '23 at 18:33
  • 1
    You need a [recursive regex](https://www.regular-expressions.info/recurse.html) to match nested: [`(?s)
    ((?:(?:(?!<\/?div).)+|(?R))*+)<\/div>`](https://regex101.com/r/IYnp1p/1) (this sample-regex will only work if data is like your sample, result will be [capture](https://www.regular-expressions.info/brackets.html) of *first group*)
    – bobble bubble Jul 30 '23 at 18:38
  • 1
    Just to mention, will be more efficient if you use [negated](https://www.regular-expressions.info/charclass.html#negated) `<` instead of the `(?:(?!<\/?div).)+` part: [`
    ((?:[^<]+|<(?!\/?div\b)|(?R))*+)<\/div>`](https://regex101.com/r/RCkzW6/2) (we can also drop the dotall flag `(?s)` here as there is no dot used).
    – bobble bubble Jul 30 '23 at 18:52
  • @bobblebubble awesome - my answer has been found, thanks! – Michael Mikhjian Jul 30 '23 at 20:28
  • 1
    @MichaelMikhjian You're welcome, glad that helped! – bobble bubble Jul 31 '23 at 05:42

1 Answers1

0

This would require recursive regex to match the nested. (Answer provided by @bobble bubble)

The following will match the first group:

(?s)<div>((?:(?:(?!<\/?div).)+|(?R))*+)<\/div>

More efficient by using "negated <":

(?:(?!<\/?div).)+ part: <div>((?:[^<]+|<(?!\/?div\b)|(?R))*+)<\/div>
Michael Mikhjian
  • 2,760
  • 4
  • 36
  • 51