Filtering JavaScript out of HTML

Question

I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any way to do this?

Also, I'm using ASP.NET if that helps.

Yes, I'm using an Rich Text Editor called Cute Editor, it handles certain things like removing — , May 14 '09 at 14:53
So to actually answer your question, yes I need to keep attributes to have the full use of the RTE — , May 14 '09 at 14:54

score 11 · Answer 1 · edited Oct 27 '13 at 10:03

The only way to ensure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).

However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidentally permitted tag can make your website vulnerable to XSS.

Whitelisting (good approach)

See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:

Here is an incomplete list of potentially dangerous HTML tags and attributes:

script, which can contain malicious script

applet, embed, and object, which can automatically download and execute malicious code

meta, which can contain malicious redirects

onload, onunload, and all other on* attributes, which can contain malicious script

style, link, and the style attribute, which can contain malicious script

Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.

Blacklisting (generally bad approach)

Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.

For example, this page gives a list of what are supposedly "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.

Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)

Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.

While writing my answer I saw yours and it looks good. I actually had to code something in C# to do what you may try to do. Prevent any XSS attack. I've made a config file to know which html tags with which attributes are allowed. But you will require to have a lot of tests based on your code. (like what Noldorin was saying). — Nordes, May 13 '09 at 16:06
Blacklisting can never work, as other browsers might interpret tags you didn't even know. You need a whitelisting approach. — sleske, May 13 '09 at 16:07
On my side i'm more whitelisting than blacklisting. For the style attribute you need to remove behavior and etc. — Nordes, May 13 '09 at 16:10
@sleske: Blacklisting does work in practice, but I agree that it can be risky. Equally, if you whitelist certain tags, then there may be some harmless ones that the user might want to use that aren't allowed. Still, this is admittedly a lesser evil. I'll update the post to mention whitelisting, which is important. Fancy removing the down vote? — Noldorin, May 13 '09 at 16:10
@Noldorin: Blacklisting does work in the sense that it makes attacks harder, but it will always leave holes; that's what I meant. Anyway, now I actually like your answer :-). +1 — sleske, May 13 '09 at 16:25
@sleske: Yeah, exactly. The point is that only one accidentally allowed tag can ruin security. I've put in a lot of clarifications now, which should all be correct. Thanks for pointing this out! (I was aware of it, but it slipped my mind when I first wrote the post.) — Noldorin, May 13 '09 at 16:28
Unfortunately I can't give assisted answers on stackoverflow, this is a really great answer but AntiSamy is what I was looking for. Oddly enough it uses the HTML Agility Pack — , May 14 '09 at 15:09

score 4 · Answer 2 · answered May 13 '09 at 16:06

As pointed out by Lee Theobald, that's a very dangerous plan. You cannot by definition ever produce "safe" HTML by filtering/blacklisting, since the user might put stuff into the HTML that you didn't think about (or that don't even exist in your browser version, but does in others).

The only safe way is a whitelisting approach, i.e. strip everything but plain text and certain specific HTML constructs. This incidentially is what stackoverflow.com does :-).

score 3 · Answer 3 · answered Nov 03 '12 at 07:43

Here is how I do it using a white-listing approach (Javascript and Python code)

https://github.com/dcollien/FilterHTML

I define a specification for a subset of allowed HTML, and that is only what should get through this filter. There's some options to also purify URL attributes, by only allowing certain schemes (like http:, ftp:, etc.) and disallowing those that would cause XSS/Javascript problems (like javascript:, or even data:)

edit: This isn't going to give you 100% safety out of the box for all situations, but used intelligently and in conjunction with a few other tricks (like checking if urls are on the same domain, and the correct content-type, etc.) it could be what you need

score 2 · Answer 4 · answered May 13 '09 at 16:04

If you want the html to be changed so users can see the HTML code itself. Do a string replace of all '<', '>', '&' and ';'. For example '<' becomes '<'.

If you want the html to work, the easiest way is to remove all HTML and Javascript and then replace the HTML only. Unfortunately there is almost not sure way of removing all javascript and allowing only HTML.

For example you may want to allow images. However you may not know that you can do

<img src='evilscript.js'>

and it can run that script. It becomes very unsafe very fast$. This is why most websites like Wikipedia and this website use special markdown language. This makes it much easier to allow formatting but not malicious javascript.

score -1 · Answer 5 · answered May 13 '09 at 16:02

-1

You may want to check how some browser based WYSIWYG editors such as TinyMCE do. They usually remove JS and seem to do a resonable job at it.

answered May 13 '09 at 16:02

Darryl Hein

142,451
95
218
261

1

Yeah they do that, but if you're a bit "hacker" you can put the tinymce editor in textmode and then when you save your data it will still have a chance that the user have modified the text with javascript. – Nordes May 13 '09 at 16:04
Well, this is true for any JS. You can always disable JS and submit whatever you want. You should instead be looking at what you can do with ASP.NET then as you'll want to protect yourself on the server where you have control vs the browser where you have very littly. – Darryl Hein May 13 '09 at 16:08

score -2 · Accepted Answer · answered May 13 '09 at 15:58

-2

The simplest thing to do would be to either strip out tags with a regex. Trouble is that you could do plenty of nasty things without script tags (e.g. imbed dodgy images, have links to other sites that have nasty Javascript) . Disabling HTML completely by convert the less than/greater than characters into their HTML entities forms (e.g. <) could also be an option.

If you want a more powerful solution, in the past I have used AntiSamy to sanitize incoming text so that it's safe for viewing.

answered May 13 '09 at 15:58

Lee Theobald

8,461
12
49
58

4

Actually, "strip out tags with a regex" is not the best of recommendations to give. – Tomalak May 13 '09 at 16:03
I'm not familiar with AntiSamy, but I would recommend that you insure it's well-designed before using it (i.e. takes a whitelisting approach for a start). Also, regex is *definitely* not the way to go even for a simple solution. – Noldorin May 14 '09 at 15:52
it is a whitelisting approach – Sep 01 '09 at 15:06
"it is a whitelisting approach" NO! Stripping tags is blacklisting. Are you 100% sure you will remove all tags? What if the attacker leaves the closing angle backet away? Does your regex catch this? => you have failed to blacklist. – usr May 26 '11 at 18:40
-1 Dangerously wrong advice. Also, about the "with a regex part": Obligatory link to canonical question about HTML & regex: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – sleske Nov 19 '12 at 09:23

Filtering JavaScript out of HTML

6 Answers6

Whitelisting (good approach)

Blacklisting (generally bad approach)

Linked