3

I download some html tree from an untrustworthy source, and use it to just display content as a child of some HTML div in my page. However, there is the danger of this downloaded code running scripts/or executing scripts in event handlers. Is it possible in HTML just like using a tag to define scripts, do a

<noscriptex>
    <script>
        ...
    </script>
</noscriptex>

then the browser wouldn't execute any code within this tag?

If there is no such thing, how do I clean up the downloaded HTML just to display DOM elements with their CSS without any scripting involved?

foobarometer
  • 751
  • 1
  • 9
  • 20

3 Answers3

1

No; there is no such feature.

Instead, you need to parse the HTML and remove any unrecognized tags and attributes using a strict whitelist.

You also need to validate attribute values; especially URLs.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • No need for parsing, the DOM can be used to find and remove script elements without executing them. Attribute values are tougher, perhaps all handlers (on*) can be set to undefined? – RobG Sep 19 '12 at 05:04
  • @RobG Thank you, this is a better suggestion, but it is pitiful that code and data live together but there is no execution protection on data like in file systems or virtual memory code vs data pages. I am not sure if all handlers can be set to undefined even before at least some of the onLoad a la onCreate kind of ones are called. – foobarometer Sep 19 '12 at 07:07
  • @SLaks it is expensive for me to write a parser, but I am considering it. – foobarometer Sep 19 '12 at 07:09
  • I googled for these before asking the question, but it turns out there are tools already out there. See [related](http://stackoverflow.com/questions/295566/sanitize-rewrite-html-on-the-client-side) and lots more on [stackoverflow](http://stackoverflow.com) – foobarometer Sep 19 '12 at 07:27
1

You can use a function to remove scripts from markup, e.g.

function stripScripts(markup) {

    var div = document.createElement('div');
    var frag = document.createDocumentFragment();

    div.innerHTML = markup;

    var scripts = div.getElementsByTagName('script');
    var i = scripts.length;

    while (i--) {
      scripts[i].parentNode.removeChild(scripts[i]);
    }

    while (div.firstChild) {
      frag.appendChild(div.firstChild);
    }
    return frag;
}

Any script elements inserted using innerHTML are not executed, so they're safe. They aren't in the DOM yet either so have limited power.

Note that the object returned by createDocumentFragment can be inserted directly into the DOM, and the fragment returned by the function has no script elements.

RobG
  • 142,382
  • 31
  • 172
  • 209
  • Keep in mind that there are other ways to get scripts in content besides ` – jfriend00 Sep 19 '12 at 06:14
  • Yes, of course. But if the OP is going to insert markup from other sites in the page, anything can happen. – RobG Sep 20 '12 at 20:44
0

This is what an iframe is for. If the content comes from a different domain than the host page, then it will not be allowed to communicate with any of the other content. You can let it run scripts to its heart's content and they can't affect your part of the page.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • The script can still do bad stuff. But I guess this is the most I can get! – foobarometer Sep 19 '12 at 07:06
  • @foobarometer - what kind of bad stuff that you think it can do are you trying to protect against. – jfriend00 Sep 19 '12 at 07:31
  • I don't want the user to think my site is bad, even if it launches alert() or worse things like detecting Windows for example, and creating ActiveXObjects. – foobarometer Sep 19 '12 at 07:44
  • @foobarometer - the only way to assure that is to only accept highly filtered content (like only text or only text with a few formatting codes), not generic HTML. There's a reason that bulletin boards use their own formatting codes and they often don't accept HTML. – jfriend00 Sep 19 '12 at 07:49
  • This means I would have to invent a markup of my own for what I need just like the bulletin board guys, but what I need is simple: HTML's presentation layer. This according to me is very broken. The right place to do it would be the browser with some execution prevention. Nevertheless, to solve the problem at hand, I am leaning towards a custom markup that can be converted to HTML for my purposes. – foobarometer Sep 19 '12 at 08:07
  • @foobarometer - You could actually still use HTML as your markup language, but where your own code parses the limited set of tags that you want to accept and do not let the browser parse anything. If you do that, you can let through only what you want to let through. You will have to not take anything (like tag attributes) that you haven't explicitly parsed yourself and decided to allow. This might be better than inventing your own markup. Even things like image URLs or link URLs will have to be sanitized though as they could contain javascript. – jfriend00 Sep 19 '12 at 16:25