0

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.

Basically I want my regex to..

Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.

such as <div id="test"[^<]*<div id="test2"

Example html

<div id="test" class="whatever">  
   <div class="wrapper">
   <fieldset>Test</fieldset><div class="testclass">some info</div>
   </div>
  <!-- end test div--></div>

</div>
 <div id="test2" class="endFind">
kate_h
  • 1
  • 1
  • What language are you using, regex's vary in format depending on the language. – Stefan Jan 05 '12 at 03:35
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – David Foerster Jan 11 '17 at 10:56

2 Answers2

1

In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.

For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".

But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div

Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.

$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$ 
ghoti
  • 45,319
  • 8
  • 65
  • 104
0

Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.

If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.

Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...

/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/