0

I have the following input

<div style="s1">title1</div>
<div style="s1">content1</div>
<div style="s1">title2</div>
<div style="s1">content2</div>

I know title1 and title2 and I want to collect content1 and content2

I would need something like this:

<div style="s1">title1</div>.*?<div style="s1">(.*?)</div>

but since regexp is greedy, it matches until the end so it returns

content1</div>
    <div style="s1">title2</div>
    <div style="s1">content2

I would like to add to the pattern a list of tags that should not be included in the match.

Something like:

<div style="s1">title1</div>.*?<div style="s1">(.*?[^<div])</div>

where I refer with [^<div] to a not contain stuff. This should be multiple options, probably with the use of |

How can I do it?

Pentium10
  • 204,586
  • 122
  • 423
  • 502

3 Answers3

4

Obligitory link.

Now that that is out of the way, just do some dom manipulation and xpath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);        

    foreach($x->query("//div") as $node)
    {
       if (trim($node->textContent) == 'title1')
       {
           $content['title1'] = $node->nextSibling->textContent;
       }
    }

Now wasn't that easy? So no more regexing html kay?

Community
  • 1
  • 1
Byron Whitlock
  • 52,691
  • 28
  • 123
  • 168
0
<div style="s1">title1</div>.*<div style="s1">(([^<]|<[^\/])*)</div>

Try this - it means find anything excepting < or < not followed by / - if you want, i can add there condition for sub-divs etc.

SergeS
  • 11,533
  • 3
  • 29
  • 35
0

Just use the U option = ungreedy : http://.php.net/manual/fr/reference.pcre.pattern.modifiers.php

soju
  • 25,111
  • 3
  • 68
  • 70