1

I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.

    "<script.*/>"

    "<script[^>]*>.*</script>"

    "<script.*?>[\\s\\S]*?</.*?script>"

Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?

An example of what I am trying to remove:

<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
    <script type="text/javascript">
    <!--
        var Time=new Application('Time')
    //-->
    </script>
    <script type="text/javascript">
        if(window['com.actions']) {
            window['com.actions'].approvalStatement =  "",
            window['com.actions'].hasApprovalStatement = false
        }
    </script>
tcables
  • 1,231
  • 5
  • 16
  • 36
  • 2
    Could you give an example of a missed block? – Whetstone Nov 07 '11 at 19:19
  • 1
    Use an HTML parser (like [Nokogiri](http://nokogiri.org)) and modify the DOM; [do not use a regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) on the raw HTML. Are you trying to do this on the web browser client or on the server? If the server, what programming language? – Phrogz Nov 07 '11 at 19:20
  • If anything, it looks like your regexes will match *more* than you want. Your #2 is doing a greedy `.*`, so it will match everything from the first ``, possibly including content *between* script tags that you didn't mean to remove. – Joe White Nov 07 '11 at 19:29
  • Language is C#. Using the mshtml parser actually runs the java script which is what I am trying to avoid by removing it in the first place. – tcables Nov 07 '11 at 19:32
  • Regex is not particularly good for **PARSING** HTML - but that is because HTML allows nesting constructs (like `hello world`) script tags have basically no nesting, so it's nowhere near as pertinent (comment or CDATA tags are often used inside script tags, but these are not a challenge to ignore). **REMOVING or STRIPPING** HTML is slightly different, as expressions can be significantly less complex. – Code Jockey Nov 07 '11 at 21:39

6 Answers6

4

I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):

@"(?s)<script.*?(/>|</script>)"

That's it - I hope! (It certainly works for your examples!)

My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags

For example,

<b> bold <i> AND italic </i></b>

...is not so bad, but

<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>

would be much harder to parse, because the ending tags are IDENTICAL.

However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.

There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:

@"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"

Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)

Code Jockey
  • 6,611
  • 6
  • 33
  • 45
  • Of course, this should handle complete removal of any valid script blocks, and valid HTML in should be valid HTML out (minus script blocks) – Code Jockey Nov 08 '11 at 15:00
3

It is generally agreed upon that trying to parse HTML with regex is a bad idea and will yield bad results. Instead, you should use a DOM parser. jQuery wraps nicely around the browser's DOM and would allow you to very easily remove all <script> tags.

Alex Turpin
  • 46,743
  • 23
  • 113
  • 145
2

ok I have faced a similar case, when I need to clean "rich text" (text with HTML formatting) from any possible javascript-ing.

there are several ways to add javascript to HTML:

  • by using the <script> tag, with javascript inside it or by loading a javascript file using the "src" attribue. ex: <script>maliciousCode();</script>

  • by using an event on an HTML element, such as "onload" or "onmouseover" ex: <img src="a.jpg" onload="maliciousCode()">

  • by creating a hyperlink that calls javascript code ex: <a href="javascript:maliciousCode()">...

This is all I can think of for now.

So the submitted HTML Code needs to be cleaned from these 3 cases. A simple solution would be to look for these patterns using Regex, and replace them by "" or do whatever else you want.

This is a simple code to do this:

public static string CleanHTMLFromScript(string str)
{
    Regex re = new Regex("<script[^>]*>", RegexOptions.IgnoreCase);
    str = re.Replace(str, "");
    re = new Regex("<[a-z][^>]*on[a-z]+=\"?[^\"]*\"?[^>]*>", RegexOptions.IgnoreCase);
    str = re.Replace(str, "");
    re = new Regex("<a\\s+href\\s*=\\s*\"?\\s*javascript:[^\"]*\"[^>]*>", RegexOptions.IgnoreCase);
    str = re.Replace(str, "");
    return(str);
}

This code takes care of any spaces and quotes that may or may not be added. It seems to be working fine, not perfect but it does the trick. Any improvements are welcome.

0

Creating your own HTML parser or script detector is a particularly bad idea if this is being done to prevent cross-site scripting. Doing this by hand is a Very Bad Idea, because there are any number of corner cases and tricks that can be used to defeat such an attempt. This is termed "black listing", as it attempts to remove the unsafe items from HTML, and it's pretty much doomed to failure.

Much safer to use a white list processor (such as AntiSamy), which only allows approved items through by automatically escaping everything else.

Of course, if this isn't what you're doing then you should probably edit your question to give some more context...

Edit:

Now that we know you're using C#, try the HTMLAgilityPack as suggested here.

Community
  • 1
  • 1
Scott A
  • 7,745
  • 3
  • 33
  • 46
  • I have had troubles with bugs in the agility pack in the past so I tend to stay away from it...but thanks for the suggestion. – tcables Nov 07 '11 at 19:38
0

Which language are you using? As a general statement, Regular Expressions are not suitable for parsing HTML.

If you are on the .net Platform, the HTML Agility Pack offers a much better parser.

Michael Stum
  • 177,530
  • 117
  • 400
  • 535
0

You should use a real html parser for the job. That being said, for simple stripping
of script blocks you could use a rudimentary regex like below.

The idea is that you will need a callback to determine if capture group 1 matched.
If it did, the callback should pass back things that hide html (like comments) back
through unchanged, and the script blocks are passed back as an empty string.

This won't substitute for an html processor though. Good luck!

Search Regex: (modifiers - expanded, global, include newlines in dot, callback func)

  (?:
     <script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*> .*? </script\s*>
   | </?script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*/?>
  )
|
  (   # Capture group 1
    <!(?:DOCTYPE.*?|--.*?--)>  # things that hide html, add more constructs here ...
  )

Replacement func pseudo code:

string callback () {
  if capture buffer 1 matched
    return capt buffer 1
  else return ''

}