DISCLAIMER:
If output-escaping is not allowed in your internet-facing solution, you are in a NO-WIN SCENARIO. It's like antivirus on Windows: You'll be able to detect specific and known attacks, but you will be unable to detect or defend against unknown attacks. If your employer insists on this path, your due diligence is to make management aware of this fact and get their acceptance of the risks in writing. Every time I've confronted management with this, they've opted for the correct solution--output escaping.
================================================================
First off... watch out when using JSoup in any kind of a cleaning/filtering/input validation situation.
Upon receiving invalid HTML, like
<script>alert(1);
Jsoup will add in the missing </script>
tag.
This means that if you're using Jsoup to "cleanse" HTML, it first transforms INVALID HTML into VALID HTML, before it begins processing.
So the question is: Is there some way through which I can make sure
that my input is stripped of all HTML and javascript code at the
filter? Should I throw in some regex validations but is there any
regex that will take care of the cases that are getting past the check
I have right now?
No. ESAPI and ESAPI's input validation is not appropriate for your use case because HTML is not a regular language and ESAPI's input for its validation are Regular Expressions. The fact is you cannot do what you ask:
Is there some way through which I can make sure that my input is
stripped of all HTML and javascript code at the filter?
And still have a functioning web application that requires user-defined HTML/JavaScript.
You can stack the deck in your favor a little bit: I would choose something like OWASP's HTML Sanitizer. and test your implementation against the XSS inputs listed here.
Many of those inputs are taken from OWASP's XSS Filter evasion cheat sheet, and will at least exercise your application against known attempts. But you will never be secure without output escaping.
===================UPDATE FROM COMMENTS==================
SO the use case is to try and block all html and javascript. My recommendation is to implement caja since it encapsulates HTML, CSS, and Javascript.
Javascript though is also difficult to manage from input validation, because like HTML, JavaScript is a non-regular language. Additionally, each browser has its own implementation that deviates in different ways from the ECMAScript spec. If you want to protect your input from being interpreted, this means you'd ideally have to have a parser for each browser family attempting to interpret user input in order to block it.
When all you've really got to do is make sure that the output is escaped. Sorry to beat a dead horse, but I have to stress that output escaping is 100x more important than rejecting user input. You want both, but if forced to choose one or the other, output escaping is less work overall.