1

Consider the following setup of HTML Purifier:

require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('Core.EscapeInvalidTags', true);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

If you run the following case:

$dirty_html = "<p>lorem <script>ipsum</script></p>";

//output
<p>lorem &lt;script&gt;ipsum&lt;/script&gt;</p>

As expected, instead of removing the invalid tags, it just escaped them all.

However, consider these other test cases:

case 1

$dirty_html = "<p>lorem <b>ipsum</p>";

//output
<p>lorem <b>ipsum</b></p>

//desired output
<p>lorem &lt;b&gt;ipsum</p>

case 2

$dirty_html = "<p>lorem ipsum</b></p>";

//output
<p>lorem ipsum</p>

//desired output
<p>lorem ipsum&lt;/b&gt;</p>

case 3

$dirty_html = "<p>lorem ipsum<script></script></p>";

//output
<p>lorem ipsum&lt;script /&gt;</p>

//desired output
<p>lorem ipsum&lt;script&gt;&lt;/script&gt;</p>

Instead of just escaping the invalid tags, first it repairs them and then escapes them. This way things can get very strange, for example:

case 4

$dirty_html = "<p><a href='...'><div>Text</div></a></p>";

//output
<p><a href="..."></a></p><div><a href="...">Text</a></div><a href="..."></a>&lt;/p&gt;

Question
Therefore, is it possible to disable the syntax repair and just escape the invalid tags?

Mark Messa
  • 440
  • 4
  • 22

1 Answers1

1

The reason you're seeing a syntax repair is because of the fundamental way that HTML Purifier approaches the topic of HTML sanitation: It first parses the HTML to understand it, then decides which of the elements to keep in the parsed representation, then renders the HTML.

You might be familiar with one of stackoverflow's most famous answers, which is an amused and exasperated observation that true regular expressions can't parse HTML - you need additional logic, since HTML is a context-free language, not a regular language. (Modern 'regular' expressions are not formal regular expressions, but that's another matter.) In other words, if you actually want to know what's going on in your HTML - so that you correctly apply your white- or blacklisting - you need to parse it, which means the text ends up in a totally different representation.

An example of how parsing causes changes between input and output is that HTML Purifier strips extraneous whitespace from between attributes, which may not bother you in your case, but still stems from that the parsed representation of HTML is quite different from the text representation. It's not trying to preserve the form of your input - it's trying to preserve the function.

This gets tricky when there is no clear function and it has to start guessing. To pick an example, imagine while going through the HTML input, you come across what looks like an opening <td> tag in the middle of nowhere - you can consider it valid if there was an unclosed <td> tag a while back as long as you add a closing tag, but if you had escaped the first tag as &lt;td&gt;, you would need to discard the text data that would have been in the <td> since - depending on browser rendering - it may put data into parts of the page visually outside the fragment, i.e. places that are not clearly user-submitted.

In brief: You can't easily disable all syntax repair and/or tidying without having to rummage through the parsing guts of HTML Purifier and ensuring no information you find valuable is lost.

That said, you can try switching the underlying parsing engine with Core.LexerImpl and see if it gets you better results! :) DOMLex definitely adds missing ending nodes right from the get-go, but from a cursory glance, DirectLex may not. There is a large chunk of autoclosing logic in HTMLPurifier's MakeWellFormed strategy class which might also pose a problem for you.

Depending on why you want to preserve this data, though (to allow analysis?), saving the original input separately (while leaving HTML Purifier itself be) may provide you with a better solution.

pinkgothic
  • 6,081
  • 3
  • 47
  • 72
  • _"Depending on why you want to preserve this data"_ I'm considering using [Parsedown](http://parsedown.org/) in conjunction with HTML Purifier, as recommended by Parsedown developers. It seems that one limitation would be: what if user tries to enter the markdown `solve the inequations zw` ? .... – Mark Messa Jan 22 '18 at 15:00
  • If Parsedown is not set to escape all HTML markup, it would generate the following HTML code: `

    Consider the inequations zw

    `. However, when passing this into HTML Purifier things get messy: `

    Consider the inequations z<x and="" y="">w</x>

    `
    – Mark Messa Jan 22 '18 at 15:03
  • Ofcourse, if Parsedown is set to escape all HTML markup, then this problem would not arise. It would generate the HTML code: `

    solve the inequations z<x and y>w

    ` and HTML Purifier would mantain that.
    – Mark Messa Jan 22 '18 at 15:06
  • Yeah, you should render full HTML out of Parsedown if possible, then throw the result through HTML Purifier (assuming you still want to sanitise the result). ` – pinkgothic Jan 24 '18 at 14:31
  • (See http://parsedown.org/demo?set%5BMarkupEscaped%5D=1 to play around with that.) // Edit: Oh, wait, I see from your follow-up question you already played around with this. Nevermind! :) – pinkgothic Jan 24 '18 at 14:31
  • _"I guess this is what setting MarkupEscaped=1 does."_ Yes, I have already tried that. The problem of `MarkupExcaped=true` is that you loose the freedom to include some important html tags (ex: `` and ``). Of course, I could live with that. But my guess is that the developers somehow have already solve this issue. Maybe, whitelists + blacklists ... I'm not sure. – Mark Messa Jan 24 '18 at 15:53
  • _"I see from your follow-up question"_ Yeap, probably I'll receive another [Tumbleweed Badge](https://stackoverflow.com/help/badges/63/tumbleweed) for this one! :’( – Mark Messa Jan 24 '18 at 16:05
  • I don't have any idea about that one yet, alas. I feel like if you allow something to generate invalid HTML (and Parsedown will do that, and I don't blame it), I'm afraid you need to live with that it's invalid HTML, with all the ambiguities and side-effects that means for further parsing. But this is a non-answer, of course, and not even slightly helpful. :( Good luck! – pinkgothic Jan 24 '18 at 16:20