find-and-replace-in-html regular expression fails

Question

I have a regular expression that looks through html content for some keywords that used to work, but now fails and i don't understand why. (The regular expression came from this thread.)

$find = '/(?![^<]+>)(?<!\w)(' . preg_quote($t['label']) . ')\b/s';
$text = preg_replace_callback($find, 'replaceCallback', $text);

function replaceCallback($match) {
        if (is_array($match)) {
            $htmlVersion = $match[1];
            $urlVersion = urlencode($htmlVersion);
            return '<a class="tag" rel="tag-definition" title="Click to know more about ' . $htmlVersion . '" href="?tag=' . $urlVersion . '">' . $htmlVersion . '</a>';
        }
        return $match;
    }

The error message points to the preg_replace_Callback call and says:

Warning: preg_replace_callback() [function.preg-replace-callback]: Unknown modifier 't' in /frontend.functions.php  on line 43

HTML is not a regular language so regular expressions may not be the best tool here. — Mark Byers, Jun 29 '10 at 09:00
You shouldn't use regular expressions to parse html. See here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Oded, Jun 29 '10 at 09:00

score 0 · Accepted Answer · answered Jun 29 '10 at 09:31

Please note: this is not an attempt to provide a fix for the regex. It is just here to show how difficult it is (dare I say impossible) to create a regex that will successfully parse HTML. Even well structured XHTML would be nightmarishly difficult, but poorly structured HTML is a no-go for regular expressions.

I agree 100% that using regular expressions to attempt HTML parsing is a very bad idea. The following code uses the supplied function to parse some simple HTML tags. It trips up on its second attempt when it finds the nested HTML tag <em>Test<em>:

$t['label'] = 'Test';
$text = '<p>Test</p>';

$find = '/(?![^<]+>)(?<!\w)(' . preg_quote($t['label']) . ')\b/s';
$text = preg_replace_callback($find, 'replaceCallback', $text);

echo "Find:   $find\n";
echo 'Quote:  ' . preg_quote($t['label']) . "\n";
echo "Result: $text\n";

/* Returns:

Find:   /(?![^<]+>)(?<!\w)(Test)\b/s
Quote:  Test
Result: <p><a class="tag" rel="tag-definition" title="Click to know more about Test" href="?tag=Test">Test</a></p>

*/

$t['label'] = '<em>Test</em>';
$text = '<p>Test</p>';

$find = '/(?![^<]+>)(?<!\w)(' . preg_quote($t['label']) . ')\b/s';
$text = preg_replace_callback($find, 'replaceCallback', $text);

echo "Find:   $find\n";
echo 'Quote:  ' . preg_quote($t['label']) . "\n";
echo "Result: $text\n";

/* Returns:

Find:   /(?![^<]+>)(?<!\w)(Test)\b/s
Quote:  Test
Result: <p><a class="tag" rel="tag-definition" title="Click to know more about Test" href="?tag=Test">Test</a></p>
Warning: preg_replace_callback() [function.preg-replace-callback]: Unknown modifier '\' in /test.php  on line 25
Find:   /(?![^<]+>)(?<!\w)(\<em\>Test\</em\>)\b/s
Quote:  \<em\>Test\</em\>

Result: 

*/

function replaceCallback($match) {
    if (is_array($match)) {
        $htmlVersion = $match[1];
        $urlVersion = urlencode($htmlVersion);
        return '<a class="tag" rel="tag-definition" title="Click to know more about ' . $htmlVersion . '" href="?tag=' . $urlVersion . '">' . $htmlVersion . '</a>';
    }
    return $match;
}

ok, i think i got it, html is not regular enough for regular expressions :) But then, how would you go about replacing words by hyperlinks in an html content? — pixeline, Jun 29 '10 at 09:43
@pixeline: :-) Sorry to hammer it in - it's just a question that comes up a lot all over the place. Regexes can seem like a good idea at first, but rarely work. Anyway, you should probably try the [DOM functions](http://www.php.net/manual/en/book.dom.php) in PHP. The [PHPro Parse HTML With PHP And DOM](http://www.phpro.org/examples/Parse-HTML-With-PHP-And-DOM.html) tutorial may help too. — Mike, Jun 29 '10 at 09:57
@pixeline Questions like yours come up at least three times a day. Search for *replace attributes in HTML* or similar keywords or just browse the questions a few pages back. The key lib you want is DOM. — Gordon, Jun 29 '10 at 12:37
it's perfect, except that if a found keyword is already inside a A tag, it creates a A tag inside the A tag... The more i've tried, the more i think that this function works really well for html fragments. DOM is pretty bad at looking after keywords. — pixeline, Jun 30 '10 at 15:43

find-and-replace-in-html regular expression fails

1 Answers1

Linked