14

I have written some code that takes a string of html and cleans away any ugly HTML from it using jQuery (see an early prototype in this SO question). It works pretty well, but I stumbled on an issue:

When using .append() to wrap the html in a div, all script elements in the code are evaluated and run (see this SO answer for an explanation why this happens). I don't want this, I really just want them to be removed, but I can handle that later myself as long as they are not run.

I am using this code:

var wrapper = $('<div/>').append($(html));

I tried to do it this way instead:

var wrapper = $('<div>' + html + '</div>');

But that just brings forth the "Access denied" error in IE that the append() function fixes (see the answer I referenced above).

I think I might be able to rewrite my code to not require a wrapper around the html, but I am not sure, and I'd like to know if it is possible to append html without running scripts in it, anyway.

My questions:

  • How do I wrap a piece of unknown html without running scripts inside it, preferably removing them altogether?

  • Should I throw jQuery out the window and do this with plain JavaScript and DOM manipulation instead? Would that help?

What I am not trying to do:

I am not trying to put some kind of security layer on the client side. I am very much aware that it would be pointless.

Update: James' suggestion

James suggested that I should filter out the script elements, but look at these two examples (the original first and the James' suggestion):

jQuery("<p/>").append("<br/>hello<script type='text/javascript'>console.log('gnu!'); </script>there")

keeps the text nodes but writes gnu!

jQuery("<p/>").append(jQuery("<br/>hello<script type='text/javascript'>console.log('gnu!'); </script>there").not('script'))`

Doesn't write gnu!, but also loses the text nodes.

Update 2:

James has updated his answer and I have accepted it. See my latest comment to his answer, though.

Community
  • 1
  • 1
Peter Jaric
  • 5,162
  • 3
  • 30
  • 42

3 Answers3

10

How about removing the scripts first?

var wrapper = $('<div/>').append($(html).not('script'));

  • Create the div container
  • Use plain JS to put html into div
  • Remove all script elements in the div

Assuming script elements in the html are not nested in other elements:

var wrapper = document.createElement('div');
wrapper.innerHTML = html;
$(wrapper).children().remove('script');

var wrapper = document.createElement('div');
wrapper.innerHTML = html;
$(wrapper).find('script').remove();

This works for the case where html is just text and where html has text outside any elements.

James An
  • 1,383
  • 1
  • 10
  • 16
  • Good idea, and it almost works, but not quite. The string that I, somewhat carelessly, call html, might contain text outside of tags, and I want that too. See my updated answer for examples. – Peter Jaric Apr 05 '11 at 11:04
  • Additionally, the "html" might be just text, with no tags, and then jQuery will treat it as a selector if it is sent to $()/jQuery(), according to the docs: http://api.jquery.com/jQuery/#jQuery2 – Peter Jaric Apr 05 '11 at 11:19
  • And the `not()` only works if script is not wrapped inside another element. – bpierre Apr 05 '11 at 11:21
  • This is promising! With a little modification it seems to work for nested scripts too: `jQuery(wrapper).find('script').remove()`. It works in Chrome, now I'll try it in IE (*cringe*) – Peter Jaric Apr 05 '11 at 11:26
  • The proof of concept works in IE too. Now I'll add it to my real code. – Peter Jaric Apr 05 '11 at 11:33
  • Hah, turns out this solution does not solve my problem all the way, although it is correct (and I will of course mark it as accepted). The rich text area I get the html-content from suffers from the same problem. So prior to my code, when the user pastes into the rich text area, the scripts are run. I am using HTMLBox, but tinyMCE and other editors I have tried today also have this problem. I tested it on their demo pages. – Peter Jaric Apr 06 '11 at 09:31
  • Oh, =( sorry to hear about it not working out after all. I don't have much experience working with web WYSIWYG editors. Thanks for the accept anyway! – James An Apr 06 '11 at 23:54
0

You should remove the script elements:

var wrapper = $('<div/>').append($(html).remove("script"));

Second attempt:

node-validator can be used in the browser: https://github.com/chriso/node-validator

var str = sanitize(large_input_str).xss();

Alternatively, PHPJS has a strip_tags function (regex/evil based): http://phpjs.org/functions/strip_tags:535

bpierre
  • 10,957
  • 2
  • 26
  • 27
  • Try this code: `jQuery("
    ").append(jQuery("
    hellothere").remove('script'))`. It still runs the script tag, actually, and removes just one of the text nodes... Interesting...
    – Peter Jaric Apr 05 '11 at 11:11
  • Yeah, the script is executed on `$(html)`… It seems the solution would be to remove the script elements with a regex? – bpierre Apr 05 '11 at 11:20
  • But that is evil: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :) – Peter Jaric Apr 05 '11 at 11:23
  • Yes, you need a real parser, something BIG like json2.js but for removing script tags :-) – bpierre Apr 05 '11 at 11:27
0

The scripts in the html kept executing for me with all the simple methods mentioned here, then I remembered jquery has a tool for this (since 1.8), jQuery.parseHTML. There's still a catch, according to the documentation events inside attributes(i.e. <img onerror>) will still run.

This is what I'm using:

var $dom = $($.parseHTML(d));

$dom will be a jquery object with the elements found

Dan
  • 1,112
  • 9
  • 15