9

What's the best library/approach for removing Javascript from HTML that will be displayed?

For example, take:

<html><body><span onmousemove='doBadXss()'>test</span></body></html>

and leave:

<html><body><span>test</span></body></html>

I see the DeXSS project. But is that the best way to go?

mtyson
  • 8,196
  • 16
  • 66
  • 106
  • Probably, the easiest way to do it is to use XSLT (write a stylesheet that copies the allowable elements and attributes), but that only works if your document is XHTML (unless XSLT has an HTML mode---I can't remember if there's one). – C. K. Young Nov 11 '10 at 16:38
  • 2
    That you wrote "IE" instead of "i.e." confused me to no end! – JasonFruit Nov 11 '10 at 16:45
  • @JasonFruit: lolz! i too got confused. – Rakesh Juyal Nov 11 '10 at 16:47
  • 2
    possible duplicate of [How to "Purify" HTML code to prevent XSS attacks in Java or JSP ?](http://stackoverflow.com/questions/3587199/how-to-purify-html-code-to-prevent-xss-attacks-in-java-or-jsp) – BalusC Nov 11 '10 at 17:01

3 Answers3

11

JSoup has a simple method for sanitizing HTML based on a whitelist. Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:

There are still a number of known XSS attacks that DeXSS does not yet detect.

A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

beetstra
  • 7,942
  • 5
  • 40
  • 44
1

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.

Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.


Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

haylem
  • 22,460
  • 3
  • 67
  • 96
1

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.

But obviously doing all this involves some overhead if you're doing this at page render time.

Richard H
  • 38,037
  • 37
  • 111
  • 138