html_entity_decode in specific regular expression for a preg_replace

Question

I have this preg_replace patterns and replacements :

$patterns = array(
    "/<br\W*?\/>/",
    "/<strong>/",
    "/<*\/strong>/",
    "/<h1>/",
    "/<*\/h1>/",
    "/<h2>/",
    "/<*\/h2>/",
    "/<em>/",
    "/<*\/em>/",
    '/(?:\<code*\>([^\<]*)\<\/code\>)/',
);
$replacements = array(
    "\n",
    "[b]",
    "[/b]",
    "[h1]",
    "[/h1]",
    "[h2]",
    "[/h2]",
    "[i]",
    "[/i]",
    '[code]***HTML DECODE HERE***[/code]',
);

In my string I want to html_entity_decode the content between these tags : <code> < $gt; </code> but keep my array structure for preg replace

so this : <code> < > </code> will be this : [code] < > [/code]

Any help will be very appreciated, thanks!

score 1 · Accepted Answer · edited May 23 '17 at 10:25

You cannot encode this in the replacement string. As PoloRM suggested, you could use preg_replace_callback specifically for your last replacement instead:

function decode_html($matches)
{
    return '[code]'.html_entity_decode($matches[1]).'[/code]';
}

$str = '<code> &lt; &gt; </code>';
$str = preg_replace_callback('/(?:\<code*\>([^\<]*)\<\/code\>)/', 'decode_html', $str);

Equivalently, using create_function:

$str = preg_replace_callback(
    '/(?:\<code*\>([^\<]*)\<\/code\>)/',
    create_function(
       '$matches',
        'return \'[code]\'.html_entity_decode($matches[1]).\'[/code]\';'
    ),
    $str
);

Or, as of PHP 5.3.0:

$str = preg_replace_callback(
    '/(?:\<code*\>([^\<]*)\<\/code\>)/',
    function ($matches) {
        return '[code]'.html_entity_decode($matches[1]).'[/code]';
    },
    $str
);

But note that in all three cases, your pattern is not really optimal. Firstly, you don't need to escape those < and > (but that is just for readability). Secondly, your first * allows infinite repetition (or omission) of the letter e. I suppose you wanted to allow attributes. Thirdly, you cannot include other tags within your <code> (because [^<] will not match them). In this case maybe you should go with ungreedy repetition instead (I also changed the delimiter for convenience):

~(?:<code[^>]*>(.*?)</code>)~

As you can already see, this is still far from perfect (in terms of correctly matching the HTML in the first place). Hence, the obligatory reminder: don't use regex to parse HTML. You will be much better off, using a DOM parser. PHP brings a built-in one, and there is also this very convenient-to-use 3rd-party one.

Thanks for answer, I think I'll consider the DOM Parser but it's a bit more complicated :p — user990463, Oct 29 '12 at 09:19
@user990463, especially with the second one I linked it is really not all that complicated. It's very easy to use (just go to their documentation and check out some of the examples). — Martin Ender, Oct 29 '12 at 09:20
Yes I wish to use this one but for technical issues (not depending on me) I can't install 3rd party extensions :( — user990463, Oct 29 '12 at 09:26
@user990463 ah I see. That does make it more laborious, but it is definitely necessary if you want to create a robust application (just think about HTML tags within attribute-strings or HTML comments ... any regex solution will badly choke on those; not even speaking of invalid HTML which can usually be partially handled by DOM parsers) — Martin Ender, Oct 29 '12 at 09:29
Yes I agree with that, regex is not really suitable for complex HTML replacement thought. So I'll get into PHP DOM Parsing ;) Thanks for advice bro! — user990463, Oct 29 '12 at 09:32

score 0 · Answer 2 · answered Oct 29 '12 at 08:06

0

Check out this:

http://www.php.net/manual/en/function.preg-replace-callback.php

You can create a callback function that applies the html_entity_decode functionality on your match.

answered Oct 29 '12 at 08:06

PoloRM

155
3
15

html_entity_decode in specific regular expression for a preg_replace

2 Answers2