filter out encoded javascript content from request

Question

I have a problem where I am trying to cleanse the request content to strip out HTML and javascript if included in the input parameters.

This is basically to protect against XSS attacks and the ideal mechanism would be to validate input and encode the output but due to some restrictions I cannot work on the output end.

All I can do at this time is to try to cleanse the input through a filter. I am using ESAPI to canonicalize the input parameters and also using jsoup with the most restrictive Whitelist.none() option to strip all HTML.

This works as long as the malicious javascript is within some HTML tags but fails for a URL with javascript code without any HTML surrounding it, eg:

http://example.com/index.html?a=40&b=10&c='-prompt``-'

ends up showing an alert on the page. This is kind of what I am doing right now:

param = encoder.canonicalize(param, false, false);
param = Jsoup.clean(param, Whitelist.none());

So the question is:

Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?
Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?

score 3 · Accepted Answer · edited May 23 '17 at 10:28

DISCLAIMER:

If output-escaping is not allowed in your internet-facing solution, you are in a NO-WIN SCENARIO. It's like antivirus on Windows: You'll be able to detect specific and known attacks, but you will be unable to detect or defend against unknown attacks. If your employer insists on this path, your due diligence is to make management aware of this fact and get their acceptance of the risks in writing. Every time I've confronted management with this, they've opted for the correct solution--output escaping.

================================================================

First off... watch out when using JSoup in any kind of a cleaning/filtering/input validation situation.

Upon receiving invalid HTML, like

<script>alert(1);

Jsoup will add in the missing </script> tag.

This means that if you're using Jsoup to "cleanse" HTML, it first transforms INVALID HTML into VALID HTML, before it begins processing.

So the question is: Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter? Should I throw in some regex validations but is there any regex that will take care of the cases that are getting past the check I have right now?

No. ESAPI and ESAPI's input validation is not appropriate for your use case because HTML is not a regular language and ESAPI's input for its validation are Regular Expressions. The fact is you cannot do what you ask:

Is there some way through which I can make sure that my input is stripped of all HTML and javascript code at the filter?

And still have a functioning web application that requires user-defined HTML/JavaScript.

You can stack the deck in your favor a little bit: I would choose something like OWASP's HTML Sanitizer. and test your implementation against the XSS inputs listed here.

Many of those inputs are taken from OWASP's XSS Filter evasion cheat sheet, and will at least exercise your application against known attempts. But you will never be secure without output escaping.

===================UPDATE FROM COMMENTS==================

SO the use case is to try and block all html and javascript. My recommendation is to implement caja since it encapsulates HTML, CSS, and Javascript.

Javascript though is also difficult to manage from input validation, because like HTML, JavaScript is a non-regular language. Additionally, each browser has its own implementation that deviates in different ways from the ECMAScript spec. If you want to protect your input from being interpreted, this means you'd ideally have to have a parser for each browser family attempting to interpret user input in order to block it.

When all you've really got to do is make sure that the output is escaped. Sorry to beat a dead horse, but I have to stress that output escaping is 100x more important than rejecting user input. You want both, but if forced to choose one or the other, output escaping is less work overall.

Thanks for the reply. I get your point. About this "And still have a functioning web application that requires user-defined HTML/JavaScript." What if I dont want to allow the users to pass in any HTML/JS as request input parameters. Is there some way to prevent that? I will go through the HTML Sanitizer that you have linked to — Ash, Mar 29 '16 at 17:51
Start here: https://github.com/OWASP/java-html-sanitizer/blob/master/docs/getting_started.md Basically what it sounds like you want is to define a policy builder that is essentially empty... it won't allow ANY HTML tags into the application. That said, just denying ALL html isn't going to stop XSS that attacks HTML Attributes... if IE you need to guard against vbscript AND javascript. — avgvstvs, Mar 29 '16 at 18:01
I don't know if the HTML Sanitizer will allow you to define attribute policies if you're already rejecting all HTML input. — avgvstvs, Mar 29 '16 at 18:02
I am trying to reject all HTML and JS input and got some success using some HTML sanitizers. The challenge is that these sanitizers rely on the input being HTML to be able to strip it. But the issue is that certain inputs can be encoded (for which I am using ESAPI.caninicalize to decode them back to the simplest form). But if an input does not have any HTML and has some malicious JS as in the 'c' param of the URL in my question above, then my code logic is failing to strip them, because the JS is not in any HTML tag like — Ash, Mar 29 '16 at 18:14
Might look into caja instead of HTML Sanitizer. (Sanitizer is targeted specifically for HTML.) https://github.com/google/caja Caja is designed to also handle CSS and user-fed javascript. — avgvstvs, Mar 31 '16 at 11:55

filter out encoded javascript content from request

1 Answers1