Matching nth html paragraph not containing an image with PHP regex

Question

I'm trying trying to use to insert content after the nth html paragraph that doesn't contain an image. So far I haven't been able to properly exclude paragraphs containing images.

What am I missing or is this outside the effective use of regex?

My code so far:

$content = '
<p><a href="#"> <img align="right" src="blah.jpg"> </a> Some paragraph text</p>

<param name="blah" value="blah"> <!-- to make sure we are only counting <p>s -->
<param name="blah" value="blah">
<param name="blah" value="blah">

<p>First paragraph to count.</p>
<p>Second paragraph to count.</p>

<p>Blah blah <a href="#">link</a><img src="blah.jpg" /> blah </p>

<p>Third paragraph to count.</p>
<p>Fourth paragraph to count.</p>
';

$insert = "\n\n".'<h3>INSERT ME</h3>'."\n\n";

$pattern = '/((?:.*?<p[\W.]*?>(?!<img)){3})(.*$)/is';

preg_match($pattern, $content, $matches);

if (!empty($matches)) {
    echo "Yes!\n";
    echo $matches[1].$insert.$matches[2];
}else{
    echo "No.\n";
    echo $content;
    echo $insert;
}

Thanks!

score 3 · Accepted Answer · edited May 23 '17 at 11:47

Once you had enough pain with the Regex fiddling, try DOM for an alternative:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://example.com/foo.htm');
$xPath = new DOMXPath($dom);
foreach ($xPath->query('/html/body//p[not(descendant::img)][3]') as $p) {
    $h3 = $dom->createElement('h3', "Regex can't parse HTML");
    if ($p->nextSibling !== NULL) {
        $p->parentNode->insertBefore($h3, $p->nextSibling);
    } else {
        $p->parentNode->appendChild($h3);
    }
}
echo $dom->saveHtml();

Fetching the 3rd paragraph from anywhere in the HTML body that does not contain an img element somewhere below that paragraph is easily done with XPath

/html/body//p[not(descendant::img)][3]

Also see Best methods to parse HTML and more on DOM by me

score 0 · Answer 2 · answered Nov 02 '10 at 05:52

This is pretty far outside the normal use of regexes. Although it might be possible, it's far easier, far more maintainable, and possibly computationally faster to split this problem up into subproblems.

First of all, regexes can't handle arbitrarily-nested comments, which are valid HTML.

Consider first splitting the content up into an array of paragraphs, looping through the paragraphs to find the third paragraph that doesn't contain an image, and inserting your text after that.

If you really have to use regexes, something like ^.*?((((<p(>|\W)(?!<img(>|\W))))(.(?!<img(>|\W)))*?(</p\W*>).*?)){3} would match what you want once you strip comments.

Explanation:

The ^ is to ensure that it only matches up to the first three paragraphs that match the pattern, and not the last 3 (or every 3). Then, it reluctantly matches anything up until the real pattern starts. The real pattern then matches a  tag (avoiding other tags that start with p, but still allowing for attributes) so long as it is not immediately followed by an <img> tag, and then reluctantly matches any character so long as it is not followed by an <img> tag until it gets to a  close tag. This means that since no character between  and  has and <img> tag following it, there is no <img> tag between  and . After that, the pattern reluctantly matches any other characters so that it allows for anything between non-image paragraphs, but so that it doesn't match non-image paragraphs themselves, and so that it doesn't match anything that's not strictly needed. This is then repeated 3 times, to get the third such paragraph.

Matching nth html paragraph not containing an image with PHP regex

2 Answers2