Regexp to get content until next div only (not containing div)

Question

I have the following input

<div style="s1">title1</div>
<div style="s1">content1</div>
<div style="s1">title2</div>
<div style="s1">content2</div>

I know title1 and title2 and I want to collect content1 and content2

I would need something like this:

<div style="s1">title1</div>.*?<div style="s1">(.*?)</div>

but since regexp is greedy, it matches until the end so it returns

content1</div>
    <div style="s1">title2</div>
    <div style="s1">content2

I would like to add to the pattern a list of tags that should not be included in the match.

Something like:

<div style="s1">title1</div>.*?<div style="s1">(.*?[^<div])</div>

where I refer with [^<div] to a not contain stuff. This should be multiple options, probably with the use of |

How can I do it?

score 4 · Accepted Answer · edited May 23 '17 at 12:20

4

Obligitory link.

Now that that is out of the way, just do some dom manipulation and xpath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);        

    foreach($x->query("//div") as $node)
    {
       if (trim($node->textContent) == 'title1')
       {
           $content['title1'] = $node->nextSibling->textContent;
       }
    }

Now wasn't that easy? So no more regexing html kay?

edited May 23 '17 at 12:20

Community

1
1

answered Feb 03 '11 at 22:21

Byron Whitlock

52,691
28
123
168

+1 true story... I know regex gives people magical warm fuzzy feelings but it is terrible for parsing DOM. – CrayonViolent Feb 03 '11 at 22:23

score 0 · Answer 2 · answered Feb 03 '11 at 22:19

0

<div style="s1">title1</div>.*<div style="s1">(([^<]|<[^\/])*)</div>

Try this - it means find anything excepting < or < not followed by / - if you want, i can add there condition for sub-divs etc.

answered Feb 03 '11 at 22:19

SergeS

11,533
3
29
35

score 0 · Answer 3 · answered Feb 03 '11 at 22:20

0

Just use the U option = ungreedy : http://.php.net/manual/fr/reference.pcre.pattern.modifiers.php

answered Feb 03 '11 at 22:20

soju

25,111
3
68
70

Regexp to get content until next div only (not containing div)

3 Answers3