Sanitizing html string with javascript using browser to interpret html

Question

I want to use a white list of tags, attributes and values to sanitize a html string, before I place it in the dom. Can safely I construct a dom element, and traverse over that to implement the white list filter, assuming that no malicious javascript could execute until I append the dom element to the document? Are there pitfalls to this approach?

I haven't used the library in the accepted answer myself, but you might check out http://stackoverflow.com/questions/5575559/javascript-based-x-html-css-sanitization , with the help pages of perhaps most relevance: https://www.owasp.org/index.php/DOM_based_XSS_Prevention_Cheat_Sheet#Guideline and http://code.google.com/p/owasp-esapi-js/wiki/MitigatingDOMBasedXSS — Brett Zamir, Feb 13 '14 at 00:52
The advantage of this over HTMLPurifier, etc. would be that it can run dynamically on the client-side without round-tripping to the server. — Brett Zamir, Feb 13 '14 at 00:54
As far as the whitelist that you need, while https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet#RULE_.236_-_Sanitize_HTML_Markup_with_a_Library_Designed_for_the_Job does make mention of one JS library, https://github.com/ecto/bleach and perhaps it could be adapted for client-side usage, it appears to rely on regular expressions which I would not trust to do the job very well (e.g., it doesn't currently match newlines within tags). — Brett Zamir, Feb 13 '14 at 01:10
I also found: https://github.com/gbirke/Sanitize.js. I like both answers here - what is the protocol about choosing the correct answer? — Piwakawaka, Feb 13 '14 at 17:45
Haven't examined it, but its approach definitely sounds like the way to go. As far as liking both answers, do you mean liking both libraries or liking both of our Stack Overflow answers? If the latter, no worries. Normally, it's whatever you liked the best (I like to pick the first poster if the answers were similar.). Once you have enough reputation, you can also up-vote other answers. — Brett Zamir, Feb 13 '14 at 22:55

Brett Zamir · Answer 1 · 2017-06-14T03:01:42.190

It doesn't appear that anything will execute until you insert into the document, as per @rvighne's answer, but there are at least these (unusual) exceptions (tested in FF 27.0):

var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("click", function(e) {
    if (e.target.nodeName.toLowerCase() === 'a') {
        alert("I will also cause side effects; I shouldn't run on the wrong link!");
    }
});
el.getElementsByTagName('a')[0].click(); // Alerts "boo!" and "I will also cause side effects; I shouldn't run on the wrong link!"

...or...

var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var el = document.createElement('div');
el.innerHTML = userInput;
el.addEventListener("cat", function(e) { this.getElementsByTagName('a')[0].click(); });
var event = new CustomEvent("cat", {"detail":{}});
el.dispatchEvent(event); // Alerts "boo!"

...or... (though setUserData is deprecated, it is still working):

var userInput = '<a href="http://example.com" onclick="alert(\'boo!\')">Link<\/a>';
var span = document.createElement('span');
span.innerHTML = userInput;
span.setUserData('key', 10, {handle: function (n1, n2, n3, src) {
    src.getElementsByTagName('a')[0].click();
}});
var div = document.createElement('div');
div.appendChild(span);
span.cloneNode(); // Alerts "Boo!"    
var imprt = document.importNode(span, true); // Alerts "Boo!"
var adopt = document.adoptNode(span, true); // Alerts "Boo!"

...or during iteration...

var userInput = '<a href="http://example.com" onclick="alert(\'Boo!\');">Link</a>';
var span = document.createElement('span');
span.innerHTML = userInput;
var treeWalker = document.createTreeWalker(
  span,
  NodeFilter.SHOW_ELEMENT,
  { acceptNode: function(node) { node.click(); } },
  false
);
var nodeList = [];
while(treeWalker.nextNode()) nodeList.push(treeWalker.currentNode); // Alerts 'Boo!'

But without these kind of (unusual) event interactions, the fact of building into the DOM alone would not, as far as I have been able to detect, cause any side effects (and of course the examples above are contrived and one wouldn't expect to encounter them very often if at all!).

I just mean that there could be some pitfalls if you are executing events which may interact with user content (at least if you are not careful filtering). I've updated the answer to give some examples...Admittedly, these cases would be quite unusual, but I just wanted to challenge the notion that "what happens in the DOM stays in the DOM". — Brett Zamir, Feb 13 '14 at 04:14
If it's not put into the DOM memory at all, then of course there won't be a problem if you filter it while still a string or if you are merely traversing (and your traverser itself isn't performing interactions which could be harmful). I just mean something doesn't need to be appended to the document to cause bad interactions. — Brett Zamir, Feb 13 '14 at 04:16

score 1 · Accepted Answer · edited May 23 '17 at 11:57

1

No script embedded in the HTML can execute until it is put in the document. Try running this code on any page:

var html = "<script>document.body.innerHTML = '';</script>";
var div = document.createElement('div');
div.innerHTML = html;

You will notice nothing change. If the "malicious" script in the HTML was run, then the document should have vanished. So, you can use the DOM to sanitize HTML without worrying about bad JS being in the HTML. As long as you snip out the script in your sanitizer of course.

By the way, your approach is pretty safe and smarter than what most people try (parse it with regex, the poor fools). However, it's best to rely on good, trusted HTML sanitizing libraries for this, like HTML Purifier. Or, if you want to do it client-side, you can use ESAPI-JS (recommended by @Brett Zamir)

edited May 23 '17 at 11:57

Community

1
1

answered Feb 13 '14 at 00:45

rvighne

20,755
11
51
73

Thanks. I guess I am looking for confirmation that there are absolutely no pitfalls to this approach. For example if the string contained 'eval', or some css... etc. – Piwakawaka Feb 13 '14 at 01:02
This solution is still vulnerable to injection attacks. The original text could end the string with a quote and a semicolon, then add it's own code that would execute directly after the "var html =" line. – Ted Bigham Aug 18 '17 at 22:58
@TedBigham: You misunderstood the example. The quoted string in the variable `html` is just for illustration, it would actually be some data obtained from an untrusted source (e.g. an AJAX request). The risk that we are trying to mitigate comes from a malicious script being injected into the document, not into your script (which is already running). – rvighne Aug 19 '17 at 00:23

score -1 · Answer 3 · answered Jan 22 '19 at 17:25

You can use a "sandboxed" iframe that won't execute anything.

var iframe = document.createElement('iframe');
iframe['sandbox'] = 'allow-same-origin';

From w3schools:

The sandbox attribute enables an extra set of restrictions for the content in the iframe. When the sandbox attribute is present, and it will:

block form submission

block script execution

disable APIs

...

P.S. That's, by the way, exactly how we do it in our Html Sanitizer https://github.com/jitbit/HtmlSanitizer - we use the browser to interpret HTML and convert it to DOM. Feel free to check the code (or actually use the component)

(disclaimer: I'm the contributor to that OSS project)

Sanitizing html string with javascript using browser to interpret html

3 Answers3