Input Validation - Regex to match scripts

Question

What are the Regex expressions that I would need to test a string against in order to make sure it doesn't contain any Javascript?

I'm using this article as a starting point.

Code in blocks:

<script.*?>[\s\S]*?</.*?script>
":javascript" code (e.g. <a href="javascript:alert('hello')"):

(?<=<.*)javascript.*:[^"]*
Event handlers within Html tags (e.g. <div onmouseover=""):

on\w+="[^"]*"

You don't want to remove all mentions of javascript. That would be a bad idea. — John Dvorak, Jan 23 '13 at 10:32
I would do it the other way around, positively test for what you *do* want — Vorsprung, Jan 23 '13 at 10:32
I recommend using a DOM parser rather than regexes. [You can't parse HTML with regexes](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). It's _trivial_ with a proper DOM parser at hand. — John Dvorak, Jan 23 '13 at 10:33
Also, you can't protect yourself from __malformed HTML__ (unclosed tags, runt closing tags...) this way. — John Dvorak, Jan 23 '13 at 10:36
-1 and voting to close as too localised. Since you are already using an HTML parser to verify the HTML is valid (do you???) knowing how to use regex to ensure it doesn't contain javascript is _not useful_. — John Dvorak, Jan 23 '13 at 10:41
No I'm not using an HTML parser to verify the HTML is valid. — Ian Warburton, Jan 23 '13 at 10:43
@IanWarburton then you have other issues to face than just cross-site scripting. No regex ever can fix unbalanced HTML. Consider if I sent this rich-text content: ``. I'm pretty sure it would basically kill your design _and_ your page javascript. — John Dvorak, Jan 23 '13 at 10:45
So I should check for the existence of '<' and '>' in a string and then run a DOM parser on the string and use the result of that to look for script elements? — Ian Warburton, Jan 23 '13 at 10:47
Just feed your entire string to the DOM parser, crawl the DOM for any active elements, remove them, and export back to string. This will balance your HTML on the go. As a bonus, it might even strip some unneccessary whitespace. — John Dvorak, Jan 23 '13 at 10:48
Cool. I would prefer not to run that on every string though because my input validation runs for *every* form value sent to a CMS. What would be a simple way to check that a value at least contains some tags? — Ian Warburton, Jan 23 '13 at 10:51
@IanWarburton Even if it doesn't, you should _still_ HTML-escape all contents that needs to be. Again, the DOM parser fixes that for you. — John Dvorak, Jan 23 '13 at 10:53
Also, why is it a bad idea to remove all mentions of Javascript? — Ian Warburton, Jan 23 '13 at 10:53
@IanWarburton then you wouldn't be able to write this question. — John Dvorak, Jan 23 '13 at 10:54
HTML-parse the inputs that can contain HTML and HTML-escape those that cannot. That's all you should do. Don't worry about performance. DOM parsing isn't slow. — John Dvorak, Jan 23 '13 at 10:57
Note that you should have a whitelist of tags, rather than a blacklist. Here are some other tags that you wouldn't like injected: `embed`, `video`, `audio`, `object`, `applet`, `form` and the related tags (most likely), ... and more are sure to come. — John Dvorak, Jan 23 '13 at 11:00
(I only get to say whether a string is valid or not. This is for a custom RequestValidator for ASP.net.) Anyhow, so I parse all values and reject any values that contain DOM elements that aren't on the white list or DOM elements with event handlers! — Ian Warburton, Jan 23 '13 at 11:10
@IanWarburton If your DOM parser is able to die on malformed HTML, you can use it to validate. The ones that I've worked with lack that ability. If this is the case of your parser, then you need to _filter_ your inputs rather than _validate_ them with a DOM parser. — John Dvorak, Jan 23 '13 at 11:13
http://htmlagilitypack.codeplex.com/ - "The parser is very tolerant with "real world" malformed HTML.". That sounds like it prides itself on not dying. So what is the different between filtering and validating? Can't I just compare all the tags in the DOM with the ones in the white list and reject the whole string if a non-match is found? — Ian Warburton, Jan 23 '13 at 11:27
Or use this... http://stackoverflow.com/questions/3107514/html-agility-pack-strip-tags-not-in-whitelist and just compare the length of the string before and after? — Ian Warburton, Jan 23 '13 at 11:31
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/23215/discussion-between-jan-dvorak-and-ian-warburton) — John Dvorak, Jan 23 '13 at 11:38

John Dvorak · Accepted Answer · 2013-01-23T11:38:27.123

Don't ever use regexes to parse HTML. You might be able to ensure it does'n contain javascript, but you can't ensure it won't be horribly broken in other ways. Instead, use a proper parser.

Also, even valid HTML that doesn't contain javascript can still contain other unpleasant elements (audio, video, CSS nodes, form elements...), I recommend using a whitelist for the HTML elements that you do allow.

Here's an example of how your code could look like (note that even though it's supposed to be pseudocode, this might actually be proper C# syntax):

string[] tagWhitelist = ['strong', 'em', 'span' /*, ...*/];
string[] attrWhitelist = [/*...*/];

void function fixNode(DOMNode node, bool dieOnError){
   if(tagWhitelist.contains(node.type()){
      node.children.each((x) => fixNode(x))
      node.attributes
         .filter((x) => !attrWhitelist.contains(x))
         .each((x) => dieOnError ? throw new InvalidTagException() : x.remove())
   }else{
      dieOnError ? throw new InvalidAttrException() : node.remove()
   }
}

...

string output = fixNode(DOMParser.load(input, {strict:false}), false).toString();

This can also be used for validation, but only if the parser is able to throw an exception on invalid HTML (the ones I've worked with always try to fix the code):

try{
   // note: if fixNode is only ever used to validate, don't use exceptions
   fixNode(DOMParser.load(input, {strict:true}), true);
   return true;
}catch(InvalidTagException, InvalidAttrException ex){
   return false;
}

Update: the code you have linked in the comment claims to do exactly this, but I cannot guarantee it actually does.

Input Validation - Regex to match scripts

1 Answers1