How to validate that a string doesn't contain HTML using C#

Question

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:

XElement.Parse("<wrapper>" + MyString + "</wrapper>")

and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.

To do this, you're probably going to have to define what you mean by "HTML" and "plain text", for example: Will you allow someone to put "" in the plain text, which looks *like* a HTML element but isn't, and also, what characters will you allow.. — Rob, Oct 15 '08 at 13:15
In my case, I'm fine saying no tags at all, so wouldn't be allowed. My users are a limited number of employees that enter products into our company website. They have started to abuse the fields a little and include HTML in fields that weren't designed to contain HTML. — Ben Mills, Oct 15 '08 at 17:29

score 67 · Answer 1 · answered Oct 15 '08 at 13:18

67

The following will match any matching set of tags. i.e. <b>this</b>

Regex tagRegex = new Regex(@"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");

The following will match any single tag. i.e. <b> (it doesn't have to be closed).

Regex tagRegex = new Regex(@"<[^>]+>");

You can then use it like so

bool hasTags = tagRegex.IsMatch(myString);

answered Oct 15 '08 at 13:18

ICR

13,896
4
50
78

The second one matches "a < b, b > c" – Jeroen K Oct 15 '18 at 09:11
I like the first one, but it does not contain cases where there are duplicating matches. For example, a string like "texttexttexttext", it will only match "texttext" and "text", igoring the overlapping "texttexttext" – Skepti_Capy Oct 27 '21 at 14:48

J c · Answer 2 · 2013-12-06T11:20:47.423

24

You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.

In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:

bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));

edited Dec 06 '13 at 11:20

answered Oct 15 '08 at 13:36

J c

6,387
3
29
29

2

A simple but effective answer! – Eric Fan Aug 13 '13 at 01:01
13

Unfortunately doesn't work if your string contains apostrophes, ampersands etc – PeteG Oct 03 '14 at 12:37
@PeteG Good point, yes, it appears that starting in .NET 4 this method actually encodes more things than it used to, such as single quotes. This makes this technique less useful. – J c Oct 04 '14 at 22:19
This says the text "abcd<" contains html – Sreejith K. Nov 14 '19 at 15:10
If you add meaningless extra encoded chars on your string,method will return true.Its too risky for security checks. – Orhano95 Dec 23 '22 at 11:06

score 15 · Answer 3 · edited May 18 '20 at 16:09

15

Here you go:

using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
  return Regex.IsMatch(checkString, "<(.|\n)*?>");
}

That is the simplest way, since items in brackets are unlikely to occur naturally.

edited May 18 '20 at 16:09

Leniel Maccaferri

100,159
46
371
480

answered Oct 15 '08 at 13:19

Josef

7,431
3
31
33

brackets are unlikely to occur naturally?! I don't follow. if somebody types "if x < 0 or y > 10" this regex will capture "< 0 or y >" Yet, there is no HTML in my example. RegEx as an HTML parser has generally been frowned upon. – ripvlan Feb 01 '21 at 20:04

score 8 · Accepted Answer · edited Sep 30 '15 at 14:18

I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:

public static bool ContainsXHTML(this string input)
{
    try
    {
        XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
        return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
    }
    catch (XmlException ex)
    {
        return true;
    }
}

One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:

public static string ConvertXHTMLEntities(this string input)
{
    // Convert all ampersands to the ampersand entity.
    string output = input;
    output = output.Replace("&amp;", "amp_token");
    output = output.Replace("&", "&amp;");
    output = output.Replace("amp_token", "&amp;");

    // Convert less than to the less than entity (without messing up tags).
    output = output.Replace("< ", "&lt; ");
    return output;
}

Now I can take a user submitted string and check that it doesn't contain HTML using the following code:

bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();

I'm not sure if this is bullet proof, but I think it's good enough for my situation.

You're checking to make sure that it doesn't contain XHTML. You're not checking to make sure that it doesn't contain HTML, which doesn't have to be well-formed XML. Also, your code will not catch "this is XHTML". — Robert Rossney, Oct 15 '08 at 18:44
Actually, old style HTML that is not well formed XML will cause the XElement.Parse method to fail. My method assumes that the Parse method failing means that the string contains some form of HTML. I guess my code really looks for any form of tags. — Ben Mills, Oct 16 '08 at 20:48
we may also use regex patten, to check opening closing tags. — bijayk, Jan 22 '13 at 07:12

score 7 · Answer 5 · answered Dec 12 '14 at 17:25

this also checks for things like < br /> self enclosed tags with optional whitespace. the list does not contain new html5 tags.

internal static class HtmlExts
{
    public static bool containsHtmlTag(this string text, string tag)
    {
        var pattern = @"<\s*" + tag + @"\s*\/?>";
        return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
    }

    public static bool containsHtmlTags(this string text, string tags)
    {
        var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);

        return ba.Count() > 0;
    }

    public static bool containsHtmlTags(this string text)
    {
        return
            text.containsHtmlTags(
                "a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
    }
}

score 2 · Answer 6 · answered Oct 15 '08 at 13:32

Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.

On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.

You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.

You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.

If your strategy for dealing with SQL injection is stripping "--" out of input, you have a bigger problem. — Robert Rossney, Oct 15 '08 at 18:40
Excellent point, Robert, but I didn't think this was the place to launch into a full explanation of defense against SQL injection, or other script injection techniques. My first line of defense against SQL injection is using parameterized SQL. What's yours? — DOK, Oct 17 '08 at 14:48

score 0 · Answer 7 · answered Mar 12 '11 at 06:33

Beware when using the HttpUtility.HtmlEncode method mentioned above. If you are checking some text with special characters, but not HTML, it will evaluate incorrectly. Maybe that's why J c used "...depending on how strict you want the check to be..."

How to validate that a string doesn't contain HTML using C#

7 Answers7

Linked