parse HTML using DOMDocument in PHP

Question

I want to use DOMDocument to parse sting came from Rich-Text-Editor, exactly what I need are:

1) Allow only (div, p, span, b, ul, ol, li, blockquotem br) tags, remove others tags with its content

Edit: I'm using strip_tags() for this

2) allow only these styles:

style="font-weight:bold"
style="font-style: italic"
style="text-decoration: underline"

3) remove any attributes in the allowed tags like class, id ...etc except align attribute only

any ideas ?

Re your second point: What happens if an element has both `bold` and `italic` styles? And what if it's a `
` or `
` element, because changing it to a `` or and `` tag would change how it works. Finally, I would point out that the `` tag is deprecated; it is recommended to use the `text-decoration:underline` style instead. — Spudley, Jul 30 '11 at 20:18
@Spudley nice I edit the question to only allow these styles only — Zamblek, Jul 30 '11 at 20:23
See question http://stackoverflow.com/questions/4979836/domdocument-in-php/22345742#22345742 — lokeshsk, Mar 12 '14 at 08:29

score 1 · Accepted Answer · edited Oct 06 '11 at 20:30

I would recommend against trying to filter HTML input using DOMDocument for security reasons, in particular, due to the risk of cross-site scripting. You can easily take care of your requirements in 1 and 3 with a filter library like HTML Purifier. For the reasons Spudley mentions, number 2 is a little more difficult. I'd start by whitelisting those style attributes in HTML Purifier and then using some logic to scan for them after filtering, adding the appropriate tags inside that element.

Here's an example for using HTML Purifier how you want (taken from basic.php). The only things I've changed are the HTML.AllowedAttributes and HTML.AllowedElements settings.

<?php
// replace this with the path to the HTML Purifier library
require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();

// configuration goes here:
$config->set('Core.Encoding', 'UTF-8'); // replace with your encoding
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); // replace with your doctype
$config->set('HTML.AllowedAttributes', '*.style, align');
$config->set('HTML.AllowedElements', 'div, p, span, b, ul, ol, li, blockquote, br');
$config->set('CSS.AllowedProperties', 'font-weight, font-style, text-decoration');


$purifier = new HTMLPurifier($config);

$html = '<div align="center" style="font-style:italic; color: red" title="removeme">Allowed</div> <img src="not_allowed.jpg" /> <script>not allowed</script>';

$filteredHtml = $purifier->purify($html);
echo '<pre>' . htmlspecialchars($filteredHtml) . '</pre>';

Which outputs:

<div align="center" style="font-style:italic;">Allowed</div>,

HTML Purifier comes with some basic sample code (docs/example/basic.php) to get you started, and there's plenty of [documentation](http://htmlpurifier.org/docs) online. — Chris Hepner, Jul 30 '11 at 20:57
Just please post sample code of using HTML Purifier to make sure that will do what I want — Zamblek, Jul 30 '11 at 21:03
@D3VELOPER: Added an example using your tags. You should be able to figure it out from there. If you want to change any other configurations, look at [this](http://htmlpurifier.org/live/configdoc/plain.html). — Chris Hepner, Jul 30 '11 at 21:26
but this allow all style rules I just want to allow just 3 style rules — Zamblek, Jul 30 '11 at 23:47
Added to the code snippet above - please look at the documentation I linked. It's in the CSS section. — Chris Hepner, Jul 31 '11 at 00:02
@D3VELOPER let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/1988/discussion-between-chris-and-d3veloper) — Chris Hepner, Jul 31 '11 at 00:03

score 0 · Answer 2 · answered Jul 30 '11 at 20:15

0

Since you only want to allow a small number of HTML elements, you could consider cleaning up the HTML code with the PHP strip_tags() function prior to giving it to the DOMDocument classes.

This will certainly be easier than parsing the DOM yourself to find elements than need to be stripped.

That should deal with part 1 of your question.

It won't deal with parts 2 or 3, but it's a good start.

answered Jul 30 '11 at 20:15

Spudley

166,037
39
233
307

I'm already did that :) nice I hope to find answer for part 2 & 3 – Zamblek Jul 30 '11 at 20:19
@D3VELOPER - see my comment on the question for the flaws I see in your plans for part 2. – Spudley Jul 30 '11 at 20:20

Cem Kalyoncu · Answer 3 · 2011-07-30T20:45:06.047

I have a code exactly to do this but its rather undocumented and uses some code I do not own but in public domain. Its pretty easy to use and it ensures all tags are closed so they do not affect your code, use fix_html function for that. It can also limit use of tags and attributes strip_tags_attributes for this, also use strip_javascript to remove javascript functionality of any sort. I used this extensively but to be honest I do not know if this one is from production. For your second answer, I guess its best to remove styles all together so they can use <i> or <b> as they like. And please dont let anyone to use underline.

function findNodeValue($parent, $node) {
    $nodes=array();
    if(!is_a($parent, "DOMElement")) return NULL;

    foreach($parent->childNodes as $child)
        if($child->nodeName==$node) $nodes[]=$child;

    if(count($nodes)==0) return NULL;
    if(count($nodes)==1) return $nodes[0]->nodeValue;
    else {
        $ret=array();
        foreach($nodes as $node)
            $ret[]=$node->nodeValue;

        return $ret;
    }
}

function strip_javascript($filter){ 

    // realign javascript href to onclick 
    $filter = preg_replace("/href=(['\"]).*?javascript:(.*)?\\1/i", "onclick=' $2 '", $filter);

    //remove javascript from tags 
    while( preg_match("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?javascript.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // dump expressions from contibuted content 
    $filter = preg_replace("/:expression\(.*?((?>[^(.*?)]+)|(?R)).*?\)\)/i", "", $filter); 
    $filter = preg_replace("/<iframe.*?>/", "", $filter);
    $filter = preg_replace("/<\/iframe>/", "", $filter);

    while( preg_match("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", $filter)) 
        $filter = preg_replace("/<(.*)?:expr.*?\(.*?((?>[^()]+)|(?R)).*?\)?\)(.*)?>/i", "<$1$3$4$5>", $filter); 

    // remove all on* events    
    while( preg_match("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", $filter, $match) ) {
        $filter = preg_replace("/<(.*)?\s?on[^>\s]+?=\s?.+?(['\"]).*?\\2\s?(.*)?>/i", "<$1$3>", $filter); 
    }

    return $filter; 
}

function html2a ( $html ) {
  ini_set('pcre.backtrack_limit', 10000);
  ini_set('pcre.recursion_limit', 10000);

  if ( !preg_match_all( '@\<\s*?(\w+)((?:\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?)\>((?:(?>[^\<]*)|(?R))*)\<\/\s*?\\1(?:\b[^\>]*)?\>|\<\s*(\w+)(\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?\/?\>@uxis', $html = trim($html), $m, PREG_OFFSET_CAPTURE | PREG_SET_ORDER) )
    return $html;
  $i = 0;
  $ret = array();
  foreach ($m as $set) {
    if ( strlen( $val = trim( substr($html, $i, $set[0][1] - $i) ) ) )
      $ret[] = $val;
    $val = $set[1][1] < 0 
      ? array( 'tag' => strtolower($set[4][0]) )
      : array( 'tag' => strtolower($set[1][0]), 'val' => html2a($set[3][0]) );
    if ( preg_match_all( '/(\w+)\s*(?:=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+)))?/usix', isset($set[5]) && $set[2][1] < 0 ? $set[5][0] : $set[2][0],$attrs, PREG_SET_ORDER ) ) {
      foreach ($attrs as $a) {
        $val['attr'][$a[1]]=$a[count($a)-1];
      }
    }
    $ret[] = $val;
    $i = $set[0][1]+strlen( $set[0][0] );
  }
  $l = strlen($html);
  if ( $i < $l )
    if ( strlen( $val = trim( substr( $html, $i, $l - $i ) ) ) )
      $ret[] = $val;
  return $ret;
}

function a2html ( $a, $in = "" ) {
  if ( is_array($a) ) {
    $s = "";
    foreach ($a as $t)
      if ( is_array($t) ) {
        $attrs=""; 
        if ( isset($t['attr']) )
          foreach( $t['attr'] as $k => $v )
            $attrs.=" ${k}=".( strpos( $v, '"' )!==false ? "'$v'" : "\"$v\"" );
        $s.= $in."<".$t['tag'].$attrs.( isset( $t['val'] ) ? ">\n".a2html( $t['val'], $in).$in."</".$t['tag'] : "/" ).">";
      } else
        $s.= $in.$t."";
  } else {
    $s = empty($a) ? "" : $in.$a."";
  }
  return $s;
}

function remove_unclosed(&$a, $allowunclosed) {
    if(!is_array($a)) return;

    foreach($a as $k=>$tag) {
        if(is_array($tag)) {
            if(!isset($tag["val"]) && !in_array($tag["tag"],$allowunclosed)) {
                //var_dump($tag["tag"]);
                unset($a[$k]);
            } elseif(is_array(@$tag["val"]))
                remove_unclosed($a[$k]["val"], $allowunclosed);
        }
    }
}

function fix_html($html, $allowunclosed=array("br")) {
    $a = html2a($html);
    remove_unclosed($a, $allowunclosed);
    return a2html($a);
}

function strip_tags_ex($str,$allowtags) { 
    $strs=explode('<',$str); 
    $res=$strs[0]; 
    for($i=1;$i<count($strs);$i++) 
    { 
        if(!strpos($strs[$i],'>')) 
            $res = $res.'&lt;'.$strs[$i]; 
        else 
            $res = $res.'<'.$strs[$i]; 
    } 
    return strip_tags($res,$allowtags);    
}

function strip_tags_attributes($string,$allowtags=allowedtags,$allowattributes=allowedattributes){
    $string=strip_javascript($string);

    $string = strip_tags_ex($string,$allowtags); 

    if (!is_null($allowattributes)) { 
        if(!is_array($allowattributes)) 
            $allowattributes = explode(",",$allowattributes); 
        if(is_array($allowattributes)) 
            $allowattributes = implode(")(?<!",$allowattributes); 
        if (strlen($allowattributes) > 0) 
            $allowattributes = "(?<!".$allowattributes.")"; 
        $string = preg_replace_callback("/<[^>]*>/i",create_function( 
            '$matches', 
            'return preg_replace("/ [^ =]*'.$allowattributes.'=(\"[^\"]*\"|\'[^\']*\')/i", "", $matches[0]);'    
        ),$string); 
    } 
    return $string; 
}

I found the source for strip_javascript http://www.php.net/manual/en/function.strip-tags.php#89453 I don't know why its not there in the code already. Probably because no name, no email no identity to refer.

nobody · Answer 4 · 2011-07-30T22:12:09.050

$allowedTags = array( 'div' => true, 'p' => true, 'span' => true, 'b' => true,
    'ul' => true, 'ol' => true, 'li' => true, 'blockquot' => true, 'em' => true, 'br' => true );

$allowedStyles = array( 'font-weight: bold' => true, 'font-style: italic' => true, 'text-decoration: underline' => true );

$allowedAttribs = array( 'align' => true );

$doc = new DOMDocument();
$doc->loadXML( '<doc><p style="font-weight: bold">test</p> <b align="left">asdfasd faksd</b><script>asdfasd</script></doc>' );

sanitizeNodeChildren( $doc->documentElement );

echo htmlentities( $doc->saveXml() );

function sanitizeNodeChildren( $parentNode ) {
    $node = $parentNode->firstChild;
    while( $node ) {
        if( !sanitizeNode( $node ) ) {
            $nodeToDelete = $node;
            $node = $node->nextSibling;
            $parentNode->removeChild( $nodeToDelete );
        } else {
            sanitizeNodeChildren( $node );
            $node = $node->nextSibling;
        }
    }
}

function sanitizeNode( $node ) {
    global $allowedTags, $allowedStyles, $allowedAttribs;
    if( $node->nodeType == XML_ELEMENT_NODE ) {
        if( !isset( $allowedTags[ $node->tagName ] ) ) return false;

        foreach( $node->attributes as $name => $attrNode ) {
            if( $name == 'style' ) {
                if( isset( $allowedStyles[ $attrNode->nodeValue ] ) ) continue;
            }
            if( isset( $allowedAttribs[ $name ] ) ) continue;
            $node->removeAttribute( $name );
        }
    }

    return true;
}

parse HTML using DOMDocument in PHP

4 Answers4