2

I've installed a syntax highlighter, but in order for it to work, the tags must be written as &lt; and &gt;. What I need to do is replace all <'s with &lt; and >'s with &gt; but only inside the PRE tag.

So, in short, I want to escape all HTML characters inside of the pre tag.

Thanks in advance.

  • Not sure I understand - are you trying to escape HTML code to display it on your page? – benedict_w Mar 31 '12 at 11:26
  • Yes, but only inside of the 'pre' tag. –  Mar 31 '12 at 11:33
  • Use `htmlspecialchars` on the tag's contents before you `echo` them. That's what you should be doing on *everything* before you echo it as well. – Jon Mar 31 '12 at 11:43
  • @Jon But how would I use it inside the pre tag only? –  Mar 31 '12 at 11:45
  • If you've used an MVC pattern then in your code you should know exactly where it outputs
     tags in the view and be able to add the `htmlspecialchars` quite simply
    – Sam Giles Mar 31 '12 at 11:58

1 Answers1

2

tl;dr

You need to parse the input HTML. Use the DOMDocument class to represent your document, parse the input, find all <pre> tags (using findElementsByTagName) and escape their content.

Code

Unfortunately, the DOM model is very low-level and forces you to iterate the child nodes of the <pre> tag yourself, to escape them. This looks as follows:

function escapeRecursively($node) {
    if ($node instanceof DOMText)
        return $node->textContent;

    $children = $node->childNodes;
    $content = "<$node->nodeName>";
    for ($i = 0; $i < $children->length; $i += 1) {
        $child = $children->item($i);
        $content .= escapeRecursively($child);
    }

    return "$content</$node->nodeName>";
}

Now this function can be used to escape every <pre> node in the document:

function escapePreformattedCode($html) {
    $doc = new DOMDocument();
    $doc->loadHTML($html);

    $pres = $doc->getElementsByTagName('pre');
    for ($i = 0; $i < $pres->length; $i += 1) {
        $node = $pres->item($i);

        $children = $node->childNodes;
        $content = '';
        for ($j = 0; $j < $children->length; $j += 1) {
            $child = $children->item($j);
            $content .= escapeRecursively($child);
        }
        $node->nodeValue = htmlspecialchars($content);
    }

    return $doc->saveHTML();
}

Test

$string = '<h1>Test</h1> <pre>Some <em>interesting</em> text</pre>';
echo escapePreformattedCode($string);

Yields:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><h1>Test</h1> <pre>Some &lt;em&gt;interesting&lt;/em&gt; text</pre></body></html>

Note that a DOM always represents a complete document. Hence when the DOM parser gets a document fragment it fills in the missing information. This makes the output potentially different from the input.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Thanks! Works fine apart from one small bug. This is displayed when I load the page: /www.w3.org/TR/REC-html40/loose.dtd"> –  Mar 31 '12 at 13:56
  • @Terry What page are you loading then? My example code displays just fine in a browser. – Konrad Rudolph Mar 31 '12 at 14:10
  • The view post page. I tested it out, it only occurs when I add your code. –  Mar 31 '12 at 14:13
  • @Terry Nevertheless, I’m fairly confident that the error is elsewhere and was just flushed out by adding my code. It works in isolation. I can’t really say any more since I don’t know your code. – Konrad Rudolph Mar 31 '12 at 14:15