17

Here is the line of code I have which works great:

$content = htmlspecialchars($_POST['content'], ENT_QUOTES);

But what I would like to do is allow only certain types of HTML code to pass through without getting converted. Here is the list of HTML code that I would like to have pass:

<pre> </pre>
<b> </b>
<em> </em>
<u> </u>
<ul> </ul>
<li> </li>
<ol> </ol>

And as I go, I would like to also be able to add in more HTML later as I think of it. Could someone help me modify the code above so that the specified list of HTML codes above can pass through without getting converted?

Garry
  • 251
  • 2
  • 13
  • Htmlspecialchars doesn't look at html, it looks at characters `<`, `>`, etc and escapes them. So you cannot do it with htmlspecialchars... maybe [htmlpurifier](http://htmlpurifier.org/)? – Esailija Oct 10 '12 at 12:54
  • 1
    You cannot. But you could convert constrained whitelisted tags back afterwards, `<em>` to `` for example. – mario Oct 10 '12 at 12:55

6 Answers6

14

I suppose you could do it after the fact:

// $str is the result of htmlspecialchars()
preg_replace('#&lt;(/?(?:pre|b|em|u|ul|li|ol))&gt;#', '<\1>', $str);

It allows the encoded version of <xx> and </xx> where xx is in a controlled set of allowed tags.

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
6

Or you can go with old style:

$content = htmlspecialchars($_POST['content'], ENT_QUOTES);

$turned = array( '&lt;pre&gt;', '&lt;/pre&gt;', '&lt;b&gt;', '&lt;/b&gt;', '&lt;em&gt;', '&lt;/em&gt;', '&lt;u&gt;', '&lt;/u&gt;', '&lt;ul&gt;', '&lt;/ul&gt;', '&lt;li&gt;', '&lt;/li&gt;', '&lt;ol&gt;', '&lt;/ol&gt;' );
$turn_back = array( '<pre>', '</pre>', '<b>', '</b>', '<em>', '</em>', '<u>', '</u>', '<ul>', '</ul>', '<li>', '</li>', '<ol>', '</ol>' );

$content = str_replace( $turned, $turn_back, $content );
Peon
  • 7,902
  • 7
  • 59
  • 100
2

I improved the way Jack attacks this issue. I added support for <br>, <br/> and anchor tags. The code will replace fist href=&quot;...&quot; to allow only this attribute to be used.

$str = preg_replace(
    array('#href=&quot;(.*)&quot;#', '#&lt;(/?(?:pre|a|b|br|em|u|ul|li|ol)(\shref=".*")?/?)&gt;#' ), 
    array( 'href="\1"', '<\1>' ), 
    $str
);
Elwin
  • 791
  • 8
  • 6
1

I made this function to sanitize all HTML special characters except for the HTML tags specified.

It first uses htmlspecialchars() to make the string safe, then it reverts the tags I want to be untouched.

The function supports attribute filtering as an option, however be careful to disable it if you care about possible XSS attacks.

I know regex is not efficient but for moderate string lengths it should be fine. You can check the regex I used here https://regex101.com/r/U6GQse/8

public function sanitizeHtml($string, $safeHtmlTags = array('b','i','u','br'), $filterAttributes = true)
{
    $string = htmlspecialchars($string);

    if ($filterAttributes) {
        $replace = "<$1$2$4>";
    } else {
        $replace = "<$1$2$3$4>";
    }
    $string = preg_replace("/&lt;\s*(\/?\s*)(".implode("|", $safeHtmlTags).")(\s?|\s+[\s\S]*?)(\/)?\s*&gt;/", $replace, $string);

    return $string;
}

// Example usage to answer the OP question
$str = "MY HTML CONTENT"
echo sanitizeHtml($str, array('pre','b','em','u','ul','li','ol'));
Andrea Mauro
  • 773
  • 1
  • 8
  • 14
0

I liked Elwin's solution, but you probably want to:

  1. Prevent Javascript: URL's in the href - or more likely: allow only http(s).
  2. Make the regex globs non-greedy in case there are multiple <a href>'s in the content.

Here is the updated version:

$str = preg_replace(
    array('#href=&quot;(https?://.*?)&quot;#', '#&lt;(/?(?:pre|a|b|br|em|u|ul|li|ol)(\shref=".*?")?/?)&gt;#' ), 
    array( 'href="\1"', '<\1>' ), 
    $str
);
Pingolin
  • 3,161
  • 6
  • 25
  • 40
BenJ
  • 21
  • 1
-3

You could use strip_tags

$exceptionString = '<pre>,</pre>,<b>,</b>,<em>,</em>,<u>,</u>,<ul>,</ul>,<li>,</li>,<ol>,</ol>';

$content = strip_tags($_POST['content'],$exceptionString );
jnoel10
  • 295
  • 2
  • 4
  • 14