5

Issue: When using HTML Purifier to process user-inputted content, line-breaks are not being translated into <br /> tags.

Consider the following user-inputted content:

Lorem ipsum dolor sit amet.
This is another line.

<pre>
.my-css-class {
    color: blue;
}
</pre>

Lorem ipsum:

<ul>
<li>Lorem</li>
<li>Ipsum</li>
<li>Dolor</li>
</ul>

Dolor sit amet,
MyName

When processed using HTML Purifier, the above is being altered to the following:

Lorem ipsum dolor sit amet. This is another line.

.my-css-class {
    color: blue;  
} 

Lorem ipsum:

  • Lorem
  • Ipsum
  • Dolor
Dolor sit amet, MyName

As you can see, "MyName" which was intended to be on a separate line by the user, is being displayed altogether with the previous line.

How to fix?

Using the PHP nl2br() function, of course. However, new issues arise whether we use it before or after purifying the content.

Here is an example when using nl2br() before HTML Purifier:

Lorem ipsum dolor sit amet.
This is another line.

.my-css-class {

    color: blue; 

} 

Lorem ipsum:

  • Lorem
  • Ipsum
  • Dolor

Dolor sit amet,
MyName

What happens is that nl2br() adds <br /> for each line-break, therefore even the ones in the <pre> block are being processed, as well as the line-breaks after each <li> tag.

What I tried

I tried a custom nl2br() function which replaces line-breaks with <br /> tags, and then removes all <br /> tags from <pre> blocks. It works great, however the issue remains for the <li> items.

Trying the same approach for <ul> blocks would also remove all <br /> tags from the <li> children, unless we would use a more complex regex to remove <br /> tags that are inside <ul> elements but outside <li> elements. But then what about nested <ul> within a <li> item? To handle all those situations we'd have to have an even more complex regex!

  • If this is the right approach, could you help me out with the regex?
  • If it's not the right approach, how could I solve this problem? I am also open to alternatives to HTML Purifier.

Other resources that I've already looked at:

Community
  • 1
  • 1
Community
  • 4,922
  • 7
  • 25
  • 37
  • 1
    `nl2br` should be used on plaintext when it's being put into an HTML context. In your case, you already have HTML. Why does your HTML not correctly contain `
    `s already for line breaks?
    – deceze Jul 15 '13 at 08:21
  • @deceze the content comes from a simple textarea, where some HTML tags are allowed. Allowing some HTML tags like ``, ``, `
    ` or `
      ` is the new trend that tends to replace [BBCode](http://en.wikipedia.org/wiki/BBCode), [Markdown](http://en.wikipedia.org/wiki/Markdown), and [Textile](http://en.wikipedia.org/wiki/Textile_(markup_language)) for example.
    – Community Jul 15 '13 at 10:06
  • 1
    So if the user is basically writing HTML, he should be writing `
    ` tags too. Maybe he's using line breaks in HTML as they were intended: to make markup more readable without actually introducing line breaks into the text. As far as I'm concerned, you can't have it both ways. :) You'd really need to parse the HTML and apply `nl2br` only on specific text nodes, excluding `
    ` elements.
    – deceze Jul 15 '13 at 10:12
  • _"So if the user is basically writing HTML, he should be writing
    tags too."_: **That's insane!** Stack Overflow itself accepts both Markdown and some HTML tags like `` and it doesn't require me to manually write `
    ` tags! That's nonsense, your suggestion is totally user-unfriendly. Most people don't use or know HTML, however allowing basic tags let the people who know to be able to use them.
    – Community Jul 15 '13 at 10:39
  • "_You'd really need to parse the HTML and apply `nl2br` only on specific text nodes, excluding `
    ` elements._": That's exactly what I want, any idea how could I achieve this using HTML Purifier (or other)? Your help is greatly appreciated ;)
    – Community Jul 15 '13 at 10:41
  • 1
    Re "That's insane": Yes, it is. And SO uses *Markdown line breaks*! SO simply does not remove some HTML tags, but it has Markdown for basic text formatting, including line breaks. That's my point: if you require your users to write HTML and HTML only, then that's the tradeoff. – deceze Jul 15 '13 at 10:49
  • You're missing the point my dear, I don't _require_ my users to write HTML, I _allow_ them to write HTML! That's a huge difference. Make it a possibility, not a requirement. – Community Jul 15 '13 at 11:19
  • 1
    And *you're* missing the point that you have a Catch-22. :) To convert bare line breaks into `
    ` tags, you need to parse the HTML to do so only on certain elements. But you have to do that before you have sanitized the HTML, which means you probably can't parse the HTML correctly. It's a very tricky proposition. I understand what you're trying to do, but the very reason it is tricky and error prone is the reason Markdown & co. came into existence in the first place. And SO is not a good example for it working, because it doesn't.
    – deceze Jul 15 '13 at 11:42
  • @deceze I understand that it's not an easy task, however I'm seeking technical assistance, not advise about the initial choice of _allowing_ (not _imposing_) some HTML tags. If you're interested into such argument please see this appropriate question about the right markup to use: http://stackoverflow.com/questions/342961/what-markup-language-for-richly-formated-content Cheers! – Community Jul 15 '13 at 11:50
  • 1
    Well, to summarize my ranting into the direction of help: you need to sanitize your HTML first, which is tricky. HTML Purifier seems to be one of the few if not *the* only library that purportedly gets this right. After that you should use a DOM processor to go through the HTML and apply `nl2br`. If Purifier by default messes up line breaks inherent in the input so you cannot do the second step afterwards, you need to customize Purifier to behave differently and/or roll the `nl2br` right into its processing. Have you investigated that possibility? I can't give you a solution in code ATM. – deceze Jul 15 '13 at 11:57
  • It's not quite what you're asking, but this might be able to help you along in spirit, at least: http://htmlpurifier.org/live/configdoc/plain.html#AutoFormat.AutoParagraph -- see if that works in any way like you expect it to. (Also, I recommend listening to deceze, who is genuinely trying to save you from a massive headache.) – pinkgothic Jul 17 '13 at 20:59
  • @pinkgothic thanks a lot for your input. Unfortunately I have already had a look at AutoParagraph, what it does is only wrapping text blocks separated by 2 consecutive line-breaks with `

    ` tags. All the single line-breaks are left unprocessed. I actually don't care about `

    ` tags, I just want all the line-breaks the user intended to add to be left intact (i.e. converted to `
    ` tags where appropriate).

    – Community Jul 18 '13 at 06:03
  • _"Also, I recommend listening to deceze, who is genuinely trying to save you from a massive headache."_ Looking for a solution to keep the user content intact while still allowing some HTML tags is a genuine goal that makes a lot of sense when you're looking from the user-experience point-of-view, and you should all be more sensitive about it. That being said, I'm looking for technical assistance on how to achieve such thing, I don't need discouragements. – Community Jul 18 '13 at 06:14

2 Answers2

6

This issue can be solved partially (if not completely) with a custom nl2br() function:

function nl2br_special($string){

    // Step 1: Add <br /> tags for each line-break
    $string = nl2br($string); 

    // Step 2: Remove the actual line-breaks
    $string = str_replace("\n", "", $string);
    $string = str_replace("\r", "", $string);

    // Step 3: Restore the line-breaks that are inside <pre></pre> tags
    if(preg_match_all('/\<pre\>(.*?)\<\/pre\>/', $string, $match)){
        foreach($match as $a){
            foreach($a as $b){
            $string = str_replace('<pre>'.$b.'</pre>', "<pre>".str_replace("<br />", PHP_EOL, $b)."</pre>", $string);
            }
        }
    }

    // Step 4: Removes extra <br /> tags

    // Before <pre> tags
    $string = str_replace("<br /><br /><br /><pre>", '<br /><br /><pre>', $string);
    // After </pre> tags
    $string = str_replace("</pre><br /><br />", '</pre><br />', $string);

    // Arround <ul></ul> tags
    $string = str_replace("<br /><br /><ul>", '<br /><ul>', $string);
    $string = str_replace("</ul><br /><br />", '</ul><br />', $string);
    // Inside <ul> </ul> tags
    $string = str_replace("<ul><br />", '<ul>', $string);
    $string = str_replace("<br /></ul>", '</ul>', $string);

    // Arround <ol></ol> tags
    $string = str_replace("<br /><br /><ol>", '<br /><ol>', $string);
    $string = str_replace("</ol><br /><br />", '</ol><br />', $string);
    // Inside <ol> </ol> tags
    $string = str_replace("<ol><br />", '<ol>', $string);
    $string = str_replace("<br /></ol>", '</ol>', $string);

    // Arround <li></li> tags
    $string = str_replace("<br /><li>", '<li>', $string);
    $string = str_replace("</li><br />", '</li>', $string);

    return $string;
}

This must be applied to the content before it is HTML-Purified. Never re-process a purified content, unless you know what you're doing.

Please note that because each line-break and double line-breaks are already kept, you should not use the AutoFormat.AutoParagraph feature of HTML Purifier:

// Process line-breaks
$string = nl2br_special($string);

// Initiate HTML Purifier config
$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set('HTML.Allowed', 'p,ul,ol,li,strong,b,em,i,u,a[href],code,pre,blockquote,cite,img[src|alt],br,hr,h3,h4');
//$purifier_config->set('AutoFormat.AutoParagraph', true); // Make sure to NOT use this

// Initiate HTML Purifier
$purifier = new HTMLPurifier($purifier_config);

// Purify the content!
$string = $purifier->purify($string);

That's it!


Furthermore, because allowing basic HTML tags was originally intended to improve user experience by not adding another markup syntax, you might want to allow users to post code, and especially HTML code, which would not be interpreted/removed by HTML Purifier.

HTML Purifier currently allows to post code but requires complex CDATA markers:

<![CDATA[
Place code here
]]>

Hard to remember and to write. To simplify the user experience as much as possible I believe it is best to allow users to add code by embedding it with simple <code> (for inline code) and <pre> (for blocks of code) tags. Here is how to do that:

function custom_code_tag_callback($code) {

    return '<code>'.trim(htmlspecialchars($code[1])).'</code>';
}
function custom_pre_tag_callback($code) {

    return '<pre><code>'.trim(htmlspecialchars($code[1])).'</code></pre>';
}

// Don't require HTMLPurifier's CDATA enclosing, instead allow simple <code> or <pre> tags
$string = preg_replace_callback("/\<code\>(.*?)\<\/code\>/is", 'custom_code_tag_callback', $string);
$string = preg_replace_callback("/\<pre\>(.*?)\<\/pre\>/is", 'custom_pre_tag_callback', $string);

Note that like the nl2br processing, it must be done before the content is HTML Purified. Also, keep in mind that if the user puts <code> or <pre> tags in his own posted code, then it will close the parent <code> or <pre> tag enclosing his code. This cannot be solved, and also applies with the original CDATA markers or with any markup, even the one used on StackOverflow (for example using the ` symbol in a code sample will close the code tag).

Finally, for a great user experience there are other things that we might want to automate like for example the links which we want to be made clickable. Luckily this can be done by HTML Purifier AutoFormat.Linkify feature.

Here is the final code that includes everything for an ultimate setup:

// === Declare functions ===

function nl2br_special($string){

    // Step 1: Add <br /> tags for each line-break
    $string = nl2br($string); 

    // Step 2: Remove the actual line-breaks
    $string = str_replace("\n", "", $string);
    $string = str_replace("\r", "", $string);

    // Step 3: Restore the line-breaks that are inside <pre></pre> tags
    if(preg_match_all('/\<pre\>(.*?)\<\/pre\>/', $string, $match)){
        foreach($match as $a){
            foreach($a as $b){
            $string = str_replace('<pre>'.$b.'</pre>', "<pre>".str_replace("<br />", PHP_EOL, $b)."</pre>", $string);
            }
        }
    }

    // Step 4: Removes extra <br /> tags

    // Before <pre> tags
    $string = str_replace("<br /><br /><br /><pre>", '<br /><br /><pre>', $string);
    // After </pre> tags
    $string = str_replace("</pre><br /><br />", '</pre><br />', $string);

    // Arround <ul></ul> tags
    $string = str_replace("<br /><br /><ul>", '<br /><ul>', $string);
    $string = str_replace("</ul><br /><br />", '</ul><br />', $string);
    // Inside <ul> </ul> tags
    $string = str_replace("<ul><br />", '<ul>', $string);
    $string = str_replace("<br /></ul>", '</ul>', $string);

    // Arround <ol></ol> tags
    $string = str_replace("<br /><br /><ol>", '<br /><ol>', $string);
    $string = str_replace("</ol><br /><br />", '</ol><br />', $string);
    // Inside <ol> </ol> tags
    $string = str_replace("<ol><br />", '<ol>', $string);
    $string = str_replace("<br /></ol>", '</ol>', $string);

    // Arround <li></li> tags
    $string = str_replace("<br /><li>", '<li>', $string);
    $string = str_replace("</li><br />", '</li>', $string);

    return $string;
}


function custom_code_tag_callback($code) {

    return '<code>'.trim(htmlspecialchars($code[1])).'</code>';
}

function custom_pre_tag_callback($code) {

    return '<pre><code>'.trim(htmlspecialchars($code[1])).'</code></pre>';
}



// === Process user's input ===

// Process line-breaks
$string = nl2br_special($string);

// Allow simple <code> or <pre> tags for posting code
$string = preg_replace_callback("/\<code\>(.*?)\<\/code\>/is", 'custom_code_tag_callback', $string);
$string = preg_replace_callback("/\<pre\>(.*?)\<\/pre\>/is", 'custom_pre_tag_callback', $string);


// Initiate HTML Purifier config
$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set('HTML.Allowed', 'p,ul,ol,li,strong,b,em,i,u,a[href],code,pre,blockquote,cite,img[src|alt],br,hr,h3,h4');
$purifier_config->set('AutoFormat.Linkify', true); // Make links clickable
//$purifier_config->set('HTML.TargetBlank', true); // Uncomment if you want links to open new tabs
//$purifier_config->set('AutoFormat.AutoParagraph', true); // Leave this commented as it conflicts with nl2br


// Initiate HTML Purifier
$purifier = new HTMLPurifier($purifier_config);

// Purify the content!
$string = $purifier->purify($string);

Cheers!

Community
  • 4,922
  • 7
  • 25
  • 37
1

maybe this will help.

function custom_nl2br($html) {
    $pattern = "/<ul>(.*?)<\/ul>/s";
    preg_match($pattern, $html, $matches);

    $html = nl2br(str_replace($matches[0], '[placeholder]', $html));
    $html = str_replace('[placeholder]',$matches[0], $html);

    return $html;
}
NemanjaLazic
  • 604
  • 4
  • 12