0

I have a simple question for regex gurus. And yes... I did try several different variations of the regex before posting here. Forgive my regex ignorance. This is targeting PHP.

I have the following HTML:

<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>

What I tried that seemed most likely to work:

 preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>(.*)<br \/>/', $haystack, $result);

The above returns nothing.

So then I tried this and I got the first group to match, but I have not been able to get the second.

preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>/', $haystack, $result);

Thank you!

a432511
  • 1,907
  • 4
  • 26
  • 48

3 Answers3

2

Regex is great. But, some things are best tackled with a parser. Markup is one such example.

Instead of using regex, I'd use an HTML parser, like http://simplehtmldom.sourceforge.net/

However, if you insist on using regex for this specific case, you can use this pattern:

if (preg_match('%</h4>(\\r?\\n)\\s+(.*?)(<br />)(.*?)(<br />)%', $subject, $regs)) {
    $first_text_string = $regs[2];
    $second_text_string = $regs[4];
} else {
    //pattern not found
}
Homer6
  • 15,034
  • 11
  • 61
  • 81
  • A comparative list of alternatives to `simplehtmldom` (which can be quite slow and cumbersome) [can be found here](http://stackoverflow.com/a/3577662/358679) – Wrikken Sep 24 '13 at 00:50
  • FYI, I also recommend RegexBuddy, as I've mentioned previously in this post: http://stackoverflow.com/a/18132398/278976 – Homer6 Sep 24 '13 at 00:54
0

This will do what you want given the exact input you provided. If you need something more generic please let me know.

(.*)<br\s*\/>(.*)<br\s*\/>

See here for a live demo http://www.phpliveregex.com/p/1i3

0

I highly recommend using DOM and XPath for this.

$doc = new DOMDocument;
@$doc->loadHTML($html); 

$xp = new DOMXPath($doc);

foreach($xp->query('//div/text()') as $n) {
   list($before, $after) = explode('<br />', trim($n->wholeText));
   echo $before . "\n" . $after;
}

But If you still decide to take the regex route, this will work for you.

preg_match_all('#</h4>\s*([^<]+)<br />([^<]+)#', $str, $matches);
hwnd
  • 69,796
  • 4
  • 95
  • 132