4

So I am receiving html from a foreign untrusted source via json. I want to display the html in a div container like so:

$('#container').html(dangerousHTMLCode);

How can I prevent most importantly javascript injection, and secondary would be altering the styles of the page in general. This is all client side. The container div should be a flexible height to match the height of the content (possibly ruling out some iframe solution).

Update: The goal is to strip out all javascript and css from the html. This includes js and css that are present in the attributes of dom items (style="", onclick="", etc)

RobKohr
  • 6,611
  • 7
  • 48
  • 69
  • You may want to have a look at this post: http://stackoverflow.com/questions/1219860/javascript-jquery-html-encoding – Kyle Mar 15 '13 at 17:42
  • Is this applicable in your case? http://stackoverflow.com/questions/4079179/jquery-html-strips-out-script-tags – QFDev Mar 15 '13 at 17:44
  • 3
    Have fun: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet – Aaron Blenkush Mar 15 '13 at 17:46
  • Do you require the flexible height? It seems impossible without an iFrame (or whitelisting tags and attributes). – John Dvorak Mar 15 '13 at 18:13
  • Not even `innerHTML` or `$.parseHTML` are safe. Are you willing to require XHTML? – John Dvorak Mar 15 '13 at 19:08
  • 2
    how do you get this data from the foreign site? via jsonp cause of crossdomain? if yes, then every effort of finding xss attempts in the received data is useless, because the xss could already have happened before you ever get notified that there is data available. – t.niese Mar 15 '13 at 19:25
  • 1
    t.niese has a good point here: what is your data source? if it's the user, why not let him attack himself, and then treat the data properly server-side, where you can use any HTML parser without the fear of accidentally running anything. If your data source is a remote website, I recommend using a server-side proxy. – John Dvorak Mar 15 '13 at 19:30
  • This site does seem to do it right. A client-side markdown implementation, most likely? – John Dvorak Mar 15 '13 at 19:46

2 Answers2

2

OK my first attempt at this was a failure. I agree with Jan Dvorak's comment that the best approach is probably to do this with an XSS tool in a server-side proxy, especially since you're probably having to go through some kind of proxy anyway because you're doing cross-site requests and if you're using JSONP all is lost already.

However, since the question asked for a jQuery way of doing this...

Ideally, you'd find an HTML parser written in javascript, use it to build an element tree, and remove any elements or attributes that didn't match a whitelist of safe attributes.

Since I'm not aware of such a parser, and since you're in a browser that does have a parser, we'll try using that. We have to be careful though, that parser is attached to a javascript engine and an HTTP client, among other things.

First, as was pointed out in feedback to my first attempt, we have to do some work before we do anything that will create DOM elements because some events can run prior to DOm insertion. We need at the very least to make sure that no onX attributes will be parsed before any DOM objects are created. It might also be a good idea to run some interference with preloading in general. To that end, let's do some simple text transformation:

var xmlNameStartChars = "a-zA-Z_\\u00c0-\\u00d6\\u00d8-\\u00f6\\u00f8-\\u02ff\\0370-\\u037d\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u2218f\\u2c00-\\u2fef\\u3001-\\udbbf\\udc00-\\udfff\\uf900-\\ufdcf\\ufdf0-\\ufffd";
var xmlNsPfx = "[" + xmlNameStartChars + "][-.0-9\\u00b7\\u0300-\\u036f\\u203f-\\u2040" + xmlNameStartChars + "]*:";
var tagStartRE = new RegExp("<\\/?(" + xmlNsPfx + ")?", "g");
var tagStartDeZRE = new RegExp("(<\\/(" + xmlNsPfx + ")?)z", "g");
dangerousHTMLCode = dangerousHTMLCode.replace(/on/gi, "z$&"); // run interference with onX
// run interference with preloading
// But don't interfere with namespaces
dangerousHTMLCode = dangerousHTMLCode.replace(/<\/?(\w*:)?/g, "$&z");

Now we have made a best effort at making this safe to build a DOM tree out of. NOTE, however, that this best effort is NOT a guarantee of safety - there may very well be attacks that I have not considered, possibly revolving around bugs in past, present or future browsers or plugins. A particular concern is that I have made the assumption that the only attributes dangerous enough to have to interfere with at this point start with "on"; I think that's the case but I'm well short of 100% confident about it.

Continue at your own risk

As Jan pointed out to me in comments to my first attempt, a whitelist approach is probably superior to a blacklist approach. I'm going to start with a pretty basic list of whitelisted elements, add/remove to taste; we'll prefix them with zs because our text manipulation did that too.

var wlElements = "zdiv, zspan, zem, zstrong, zp, za, zimg, ztable, zthead, ztbody, ztfoot, ztr, zth, ztd";
var nonWlSelector = ":not(" + wlElements + ")";
var dangerousDOM = $("<div/>").html(dangerousHTMLCode);
dangerousDOM.find(nonWlSelector).remove();

Now for the fun part, you have to remove dangerous attributes. This time I blacklist, in part because I was too lazy to think of all the attributes I'd want to whitelist... but I do whitelist URL schemes in src and href, it's not just "javascript:" that's potentially unsafe, "vbscript:" and "livescript:" at a minimum are dangerous in some browsers. You should probably whitelist attributes, there's a real chance that I forgot about or never knew about script attributes that do not start "on", for example. I haven't found a way of finding "bad" attributes without doing a brute-force DOM walk, so let's do that:

var badAttrs = /^(.*:)(zon|style|background)/i;
var suspectAttrs = /^(.*:)(src|href)$/i;
var goodSchemes = /^\s*([^:]*$|ftp:|tel:|https?:)/i;
function processAttributes(element) {
    var toRemove = [];
    var attrs = element.attributes;
    for (var i = 0; i < attrs.length; i++) {
        var name = attrs[i].name, val = attrs[i].value;
        if (badAttrs.test(name) || (suspectAttr.test(name) && !goodSchemes.test(val)) {
            toRemove.push(attrs[i].name);
        }
    }
    while (toRemove.length) {
        element.removeAttribute(toRemove.pop());
    }
}

// Start walking from the root of our DOM fragment
var root = dangerousDOM[0];
var elements = [root];

// Walk until we have no more elements, processing their attributes and adding their children
while (elements.length) {
    var elem = elements.pop();
    if (elem.hasAttributes()) {
        processAttributes(elem);
    }

    // Find children of this element and queue them up
    child = elem.firstChild;
    while (child) {
        if (child.nodeType == 1) {
            // It's an element
            elements.push(child);
        }
        child = child.nextSibling;
    }
}

And now we're ready to undo text manipulations we did at the start and inject the fragment. Again, our best effort to make this safer could have failed to consider attacks that work today, and could still allow attacks against future browser/plugin bugs. So, again, continue at your own risk.

var lessDangerousHtml = dangerousDOM.html();
lessDangerousHtml = lessDangerousHtml.replace(/z(on)/gi, "$1");
lessDangerousHtml = lessDangerousHtml.replace(tagStartDeZRE, "$1");
$("#container").html(lessDangerousHtml);

Many thanks to t.niese for constructive criticism.

MattW
  • 4,480
  • 1
  • 12
  • 11
  • 1
    this still could be dangerous. i haven't tested the following idea if it could work with some more research, but it is not an absurd idea: if you have something like this `var dangerousHTMLCode = "
    <:script"+">alert('ok');"+":script>
    ";` then `<:script>` would be mapped to `` and `z` is namespace of `html` so probably it would still be possible that it would be interpreted as `script`
    – t.niese Mar 16 '13 at 17:03
  • 1
    even if the above idea would not work, creating the native DOM and then do whitelisting is not a good idea, because you probably could have missed some `xss` possibility. – t.niese Mar 16 '13 at 17:16
  • That attack approach doesn't look like it works in HTML, but I would imagine that it would in XHTML... I can exclude it from my regexp, but yeah it's looking creaky already. Namespace prefixes also break my current treatment of attributes. – MattW Mar 16 '13 at 17:38
  • So now it makes an attempt to deal with namespace prefixes. I didn't know what to do about Unicode supplemental planes in a prefix - planes 1 through 15 _are_ allowed, but I have never dealt with those in a JS regex before - do you just match the surrogate pairs? – MattW Mar 16 '13 at 18:18
  • i don't have he time to find a working solution (if one exists). but there are also other ideas that are known that they worked: e.g. `o\uFEFFnclick=\"alert('test');\"` or `o\u00click=\"alert('test');\"` and could have bypassed your approach (probably there are similar bugs that still exists). i just want to show, that the native DOM should not be created before you rebuild the html with whitelisting. your regex is some kind of blacklisting. and this could break if you forget something or if new features are introduced in the browser, that enables scripting with other attributes. – t.niese Mar 16 '13 at 18:18
  • I was aware of those attack vectors, but I couldn't get them to work... either I just didn't try hard enough, or they work before the final html is served up to clients, e.g. because roundtripping them to a database causes the injected characters to be removed. It's certainly the case that following WhatWG's HTML parsing algorithm would cause a browser not to treat those strings as a single attribute; which of course is not to say that every browser does that correctly. – MattW Mar 16 '13 at 18:23
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/26303/discussion-between-t-niese-and-mattw) – t.niese Mar 16 '13 at 18:46
0

You could use an iframe and load the untrusted HTML inside that iframe. The iframe you have would use the sandbox attribute to prevent any JavaScript injected inside the iframe to modify the environment outside the iframe.

<iframe class="untrusted" src="http://unsafe.example.com/" sandbox />

Or if the untrusted HTML is JSON then deparse JSON on your server.

<iframe class="untrusted" src="/Unsafe?url=http://unsafe.example.com/foo.json" sandbox />
Fred
  • 12,086
  • 7
  • 60
  • 83