HTML Regex to Extract Data

Question

I have a simple question for regex gurus. And yes... I did try several different variations of the regex before posting here. Forgive my regex ignorance. This is targeting PHP.

I have the following HTML:

<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>
<div>
    <h4>
        <a href="somelink.html">some text blah</a>
    </h4>
    I need this text<br />I need this text too.<br />
</div>

What I tried that seemed most likely to work:

 preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>(.*)<br \/>/', $haystack, $result);

The above returns nothing.

So then I tried this and I got the first group to match, but I have not been able to get the second.

preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>/', $haystack, $result);

Thank you!

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Dai, Sep 24 '13 at 00:41
`.*` won't match newlines without [the `/s` modifier](http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php). — quietmint, Sep 24 '13 at 00:42
@user113215 /s worked to get the first match, but the lines repeat. It's only retrieving the first instance. — a432511, Sep 24 '13 at 01:12

score 2 · Answer 1 · answered Sep 24 '13 at 00:45

2

Regex is great. But, some things are best tackled with a parser. Markup is one such example.

Instead of using regex, I'd use an HTML parser, like http://simplehtmldom.sourceforge.net/

However, if you insist on using regex for this specific case, you can use this pattern:

if (preg_match('%</h4>(\\r?\\n)\\s+(.*?)(<br />)(.*?)(<br />)%', $subject, $regs)) {
    $first_text_string = $regs[2];
    $second_text_string = $regs[4];
} else {
    //pattern not found
}

answered Sep 24 '13 at 00:45

Homer6

15,034
11
61
81

A comparative list of alternatives to `simplehtmldom` (which can be quite slow and cumbersome) [can be found here](http://stackoverflow.com/a/3577662/358679) – Wrikken Sep 24 '13 at 00:50
FYI, I also recommend RegexBuddy, as I've mentioned previously in this post: http://stackoverflow.com/a/18132398/278976 – Homer6 Sep 24 '13 at 00:54

score 0 · Answer 2 · answered Sep 24 '13 at 01:00

0

This will do what you want given the exact input you provided. If you need something more generic please let me know.

(.*)<br\s*\/>(.*)<br\s*\/>

See here for a live demo http://www.phpliveregex.com/p/1i3

answered Sep 24 '13 at 01:00

Timothy Huertas

113
6

hwnd · Accepted Answer · 2015-08-03T19:33:49.970

0

I highly recommend using DOM and XPath for this.

$doc = new DOMDocument;
@$doc->loadHTML($html); 

$xp = new DOMXPath($doc);

foreach($xp->query('//div/text()') as $n) {
   list($before, $after) = explode('<br />', trim($n->wholeText));
   echo $before . "\n" . $after;
}

But If you still decide to take the regex route, this will work for you.

preg_match_all('#</h4>\s*([^<]+)<br />([^<]+)#', $str, $matches);

edited Aug 03 '15 at 19:33

answered Sep 24 '13 at 02:13

hwnd

69,796
4
95
132

This worked as advertised. The others would not catch repeating groups. Thanks! – a432511 Sep 24 '13 at 16:07

HTML Regex to Extract Data

3 Answers3