Stripping HTML from a string regex not working in specific case

Question

I have the following code:

    public static string StripHtml(string htmlString)
    {
        string cleansedString = htmlString;
        if (!string.IsNullOrEmpty(htmlString))
        {
            //<<TestString>script> will result in <script> with this regex a lone. So we also 
            string regex = @"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>";
            cleansedString = Regex.Replace(htmlString, regex, string.Empty, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
        }
        return cleansedString;
    }

This method should strip HTML out in order to prevent users from doing HTML Injection on an ASP.NET web page (and also excel upload process on the same fields).

It works perfect except in this user case:

"<<TestString>script>" will result in "<script>"

How can I stop this from happening? I was thinking of running it in a loop to continue to StripHTML WHILE there was any brackets. But this seems like a hack. Is there a better way to write this regex to account for this use case?

There are also other edge cases like `< script> aa script>` or `<script> aa </script>` where your regex doesn't work. See this answer http://stackoverflow.com/a/1732454/932418 — Eser, Aug 27 '15 at 17:06
Why not use the HttpServerUtility.HtmlEncode method? https://msdn.microsoft.com/en-us/library/w3te6wfz(v=vs.110).aspx — bumpy, Aug 27 '15 at 17:12
Also, in cases like yours where there are nested tags, you could do something recursively stripping tags until none exist in the string... — dub stylee, Aug 27 '15 at 17:22

score 0 · Accepted Answer · answered Aug 27 '15 at 17:17

Disclaimer: Shouldn't use regex to parse html.

If you must, these match all the tags, just replace with nothing.

Script(content) and all tags:

@"<(?:script(?:\s+(?:""[\S\s]*?""|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:""[\S\s]*?"")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>"

All tags only:

@"<(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:""[\S\s]*?"")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>"

Stripping HTML from a string regex not working in specific case

1 Answers1