3

As an alternative to markdown, I'm looking for a way to safely parse a configurable subset of HTML, in JavaScript.

For example, for an (untrusted) input of

<b onclick="alert('XSS');" data-myapp="asdf" style="color:red">Hello</b>
<h1><i style="color:expression(alert('XSS'));"> World</i></h1>

with the parameters

allowTags: b, i
allowAttrs: data-myapp
allowSafeStyle: color

I'd expect the output

<b data-myapp="asdf" style="color:red">Hello</b>
<i> World</i>

Markdown doesn't seem to be able to express more complex properties. Caja seems to be pretty close to what I want, but requires server-side rendering. So, how do can I render a safe (according to the above parameters allowTags, allowAttrs, etc.) subset of HTML in JavaScript?

phihag
  • 278,196
  • 72
  • 453
  • 469
  • Where do you get the "untrusted input" from in js? If it comes from the user, it will do no harm, if it comes from the server it needs to be already sanitized. – Bergi Jul 03 '12 at 11:44
  • @Bergi It comes from another user, via [WebRTC](http://www.webrtc.org/) or a ["dumb" WebSockets proxy](https://github.com/mcolyer/em-websocket-proxy). In any case, if it is possible to sanitize/transform the code on the server, it should be as well in client-side JavaScript, shouldn't it? – phihag Jul 03 '12 at 11:46
  • Client side will never be safe because JS can be turned off. You should allways validate or filter server side for security. Client side validation is only for user convenience. – Pein Jul 03 '12 at 11:35
  • 1
    Huh? If JavaScript is turned off, nothing would get rendered in the first place. The input is a JavaScript string, or a preparsed DOM node object. Since my application must also work when the server is offline, I can't offload the validation/transformation to the server – phihag Jul 03 '12 at 11:40
  • OK, I'd prefer a smart WS proxy :-) But with a P2P connection, I can understand your need for clientside sanitizing. – Bergi Jul 03 '12 at 11:52
  • @phihag : Doing the clientside is a bad idea, the "hackers" could easily open your script and look for holes they could exploit, and there will always be a few. Do this serverside with PHP, aspx etc. – OptimusCrime Jul 03 '12 at 12:08
  • @OptimusCrime What you're suggesting is known as [security by obscurity](http://en.wikipedia.org/wiki/Security_by_obscurity), and dangerous at best (not to mention that if security of the server code matters, you should publish it anyways to fulfill [Kerkhoff's principle](http://en.wikipedia.org/wiki/Kerckhoffs%27s_principle) ). And no offense, but a vague notion of `"hackers"` does not substitute for an attacker model. In my case, the attacker is another node in the network. Note that as I wrote above, my application must also work if the server is unreachable. – phihag Jul 03 '12 at 15:59
  • possible duplicate of [Sanitize/Rewrite HTML on the Client Side](http://stackoverflow.com/questions/295566/sanitize-rewrite-html-on-the-client-side) – phihag Jul 03 '12 at 17:43

1 Answers1

1

I use jQuery to make my answer shorter and include less boilerplate code but it's not relevant.

I am using .innerHTML because it doesn't execute possible scripts or css in the html.

Demo here http://jsfiddle.net/QCaGq/

function filterData(data, options ){
    var root;

    try {
        root = document.implementation.createHTMLDocument().body;
    }
    catch(e) {
        root = document.createElement("body");
    }

    root.innerHTML = data;

    $(root).find("*").filter(function(){
        return options.allowTags.indexOf(this.tagName.toLowerCase()) === -1;
    }).each( function() {
        $(this).children().detach().insertBefore( this );
        $(this).remove();
    });

    function removeStyle( node, attr ) {
        var style = node.style,
            prop,
            name,
            len = style.length,
            i, val = "";

        for( i = 0; i < len; ++i ) {
            name = style[i];
            prop = style[name];

            if( options.allowSafeStyle.indexOf( name ) > -1 ) {
                val += (name + ":" + prop + ";");
            }
        }

        if( val ) {
            attr.nodeValue = val;
        }
        else {
            node.removeAttribute("style");
        }
    }

    function removeAttrs( node ) {
        $.each( node.attributes, function( index, attr ) {

            if( !attr ) {
                return;
            }

            if( attr.name.toLowerCase() === "style" ) {
                return removeStyle( node, attr );
            }

            if( options.allowAttrs.indexOf(attr.name.toLowerCase()) === -1 ) {
                node.removeAttribute(attr.name);
            }
        });
    }

    function walk( root ) {
        removeAttrs(root);
        $( root.childNodes ).each( function() {
            if( this.nodeType === 8 ) { //Remove html comments
                $(this).remove();
            }
            else if( this.nodeType === 1 ) {
                walk(this);
            }
        });
    }

    walk(root);

    return root.innerHTML; 
}

var opts = {
    allowTags: ["b", "i"],
    allowAttrs: ["data-myapp"],
    allowSafeStyle: ["color"]
}

filterData( '<b onclick="alert(\'XSS\');" data-myapp="asdf" style="color:red">Hello</b>\n<h1><i style="color:expression(alert(\'XSS\'));"> World</i></h1>', opts );

Results in:

<b data-myapp="asdf" style="color:red;">Hello</b>
<i> World</i>

This should get you started.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Wouldn't this ` – Oleg V. Volkov Jul 03 '12 at 12:07
  • +1 I was looking for a well-tested library, but I guess I've to write it myself. Your code looks good, although I'm not certain that there is a contract that specifies that evaluating `.style` properties of an unattached document is safe, although it seems to work great in practice (including on IE7). – phihag Jul 03 '12 at 17:06