1

I need to save some data with some HTML tags, so I can not use strip_tags for all text and I can not use htmlentities because the text must be modified by the tags. To defend other users against XSS I must remove any JavaScript from inside of the tags.

What is the best way to do this?

starbeamrainbowlabs
  • 5,692
  • 8
  • 42
  • 73
BASILIO
  • 847
  • 2
  • 12
  • 26
  • 1
    http://stackoverflow.com/questions/1886740/php-remove-javascript – Michal Apr 09 '13 at 16:07
  • If you're looking to filter using JavaScript, a similar question has been asked at http://stackoverflow.com/questions/295566/sanitize-rewrite-html-on-the-client-side. – KernelPanik Apr 09 '13 at 16:10

4 Answers4

3

If you need to save HTML tags in your database, and latter want to print it back to browser, there is no 100% secure way to achieve this using built in PHP functions. Its easy when there is no html tags, when you have text only you can use built in PHP functions to clear text.

There are some functions that clear XSS from text but they are not 100% secure and there is always a way for XSS to go unnoticed. And your regex example is fine but what if i use lets say < script>alert('xss')</script>, or some other combination that regex could miss and browser would execute.

The best way to do this is to use something like HTML Purifier

Also note that there are two levels of security, first is when things go into your database, and second when they are going out of your database.

Hope this helps!

Matija
  • 2,610
  • 1
  • 18
  • 18
  • 3
    There _are_ 100% secure ways to do it, using HTML parsers (actual parsers, not regex-based parsers) and a tag and attribute whitelist. All Stack Exchange websites do it. – zneak Apr 09 '13 at 16:15
  • Didn't I linked HTML Purifier in my answer? :) I said its not 100% secure using PHP built in functions, or using regex. – Matija Apr 09 '13 at 16:17
  • 1
    I'm mostly addressing the first paragraph of your answer. – zneak Apr 09 '13 at 16:19
  • Oh you are 100% right, now when I read it again I can see what you mean. I expressed myself in a wrong way. My bad! – Matija Apr 09 '13 at 16:20
  • Thank you, i will try HTML Purifier, but i can not find anywhere a easy written example, like `$safer_text = function($_POST['textarea'],$allowed_tags);` . Btw. the allowed tags variable, how must it see out? – BASILIO Apr 09 '13 at 19:05
2

I suggest that you use DOMDocument (with loadHTML) to load said HTML, remove every kind of tag and every attribute you don't want to see, and save back the HTML (using saveXML or saveHTML). You can do that by recursively iterating over the children of the document's root, and replacing tags you don't want by their inner contents. Since loadHTML loads code in a similar way browsers do, it's a much safer way to do it than using regular expressions.

EDIT Here's a "purifying" function I made:

<?php

function purifyNode($node, $whitelist)
{
    $children = array();
    // copy childNodes since we're going to iterate over it and modify the collection
    foreach ($node->childNodes as $child)
        $children[] = $child;

    foreach ($children as $child)
    {
        if ($child->nodeType == XML_ELEMENT_NODE)
        {
            purifyNode($child, $whitelist);
            if (!isset($whitelist[strtolower($child->nodeName)]))
            {
                while ($child->childNodes->length > 0)
                    $node->insertBefore($child->firstChild, $child);

                $node->removeChild($child);
            }
            else
            {
                $attributes = $whitelist[strtolower($child->nodeName)];
                // copy attributes since we're going to iterate over it and modify the collection
                $childAttributes = array();
                foreach ($child->attributes as $attribute)
                    $childAttributes[] = $attribute;

                foreach ($childAttributes as $attribute)
                {
                    if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value))
                        $child->removeAttribute($attribute->name);
                }
            }
        }
    }
}

function purifyHTML($html, $whitelist)
{
    $doc = new DOMDocument();
    $doc->loadHTML($html);

    // make sure <html> doesn't have any attributes
    while ($doc->documentElement->hasAttributes())
        $doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0));

    purifyNode($doc->documentElement, $whitelist);
    $html = $doc->saveHTML();
    $fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html>
    return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1
}

?>

You would call purifyHTML with an unsafe HTML string and a predefined whitelist of tags and attributes. The whitelist format is 'tag' => array('attribute' => 'regex'). Tags that don't exist in the whitelist are stripped, with their contents inlined in the parent tag. Attributes that don't exist for a given tag in the whitelist are removed as well; and attributes that exist in the whitelist, but that don't match the regex, are removed as well.

Here's an example:

<?php

$html = <<<HTML
<p>This is a paragraph.</p>
<p onclick="alert('xss')">This is an evil paragraph.</p>
<p><a href="javascript:evil()">Evil link</a></p>
<p><script>evil()</script></p>
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
HTML;

// whitelist format: tag => array(attribute => regex)
$whitelist = array(
    'b' => array(),
    'i' => array(),
    'u' => array(),
    'p' => array(),
    'img' => array('src' => '@\Ahttp://.+\Z@', 'alt' => '@.*@'),
    'a' => array('href' => '@\Ahttp://.+\Z@')
);

$purified = purifyHTML($html, $whitelist);
echo $purified;

?>

The result is:

<p>This is a paragraph.</p>
<p>This is an evil paragraph.</p>
<p><a>Evil link</a></p>
<p>evil()</p>
<p>This is an evil image: <img></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>

Obviously, you don't want to allow any on* attribute, and I would advise against style because of weird proprietary properties like behavior. Make sure all URL attributes are validated with a decent regex that matches the full string (\Aregex\Z).

zneak
  • 134,922
  • 42
  • 253
  • 328
  • 1
    Will it work with fragments of HTML, or will it try to create a full document, `` tags and all? – cHao Apr 09 '13 at 16:16
  • @cHao, it will try to create a full document, but you just have to iterate over what's inside ``. Besides, if you use the recursive approach and don't whitelist html and body, it should work just as if it was a fragment. – zneak Apr 09 '13 at 16:22
  • @Hogan, I'll delete the answer if you can. – zneak Apr 09 '13 at 19:15
2

You have to parse the HTML if you want to allow specific tags.

There is already a nice library for that purpose: HTML Purifier (Opensource under LGPL)

ComFreek
  • 29,044
  • 18
  • 104
  • 156
0

i wrote this code for this you can set list of tag and attribute for remove

function RemoveTagAttribute($Dom,$Name){
    $finder = new DomXPath($Dom);
    if(!is_array($Name))$Name=array($Name);
    foreach($Name as $Attribute){
        $Attribute=strtolower($Attribute);
        do{
          $tag=$finder->query("//*[@".$Attribute."]");
          //print_r($tag);
          foreach($tag as $T){
            if($T->hasAttribute($Attribute)){
               $T->removeAttribute($Attribute);
            }
          }
        }while($tag->length>0);  
    }
    return $Dom;

}
function RemoveTag($Dom,$Name){
    if(!is_array($Name))$Name=array($Name);
    foreach($Name as $tagName){
        $tagName=strtolower($tagName);
        do{
          $tag=$Dom->getElementsByTagName($tagName);
          //print_r($tag);
          foreach($tag as $T){
            //
            $T->parentNode->removeChild($T);
          }
        }while($tag->length>0);
    }
    return $Dom;

}

example:

  $dom= new DOMDocument; 
   $HTML = str_replace("&", "&amp;", $HTML);  // disguise &s going IN to loadXML() 
  // $dom->substituteEntities = true;  // collapse &s going OUT to transformToXML() 
   $dom->recover = TRUE;
   @$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML); 
   // dirty fix
   foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
      $dom->removeChild($item); // remove hack
   $dom->encoding = 'UTF-8'; // insert proper
  $dom=RemoveTag($dom,"script");
  $dom=RemoveTagAttribute($dom,array("onmousedown","onclick"));
  echo $dom->saveHTML();
mohammad mohsenipur
  • 3,218
  • 2
  • 17
  • 22