34

Does anyone have a simple, efficient way of checking that a string doesn't contain HTML? Basically, I want to check that certain fields only contain plain text. I thought about looking for the < character, but that can easily be used in plain text. Another way might be to create a new System.Xml.Linq.XElement using:

XElement.Parse("<wrapper>" + MyString + "</wrapper>")

and check that the XElement contains no child elements, but this seems a little heavyweight for what I need.

Ben Mills
  • 27,454
  • 14
  • 42
  • 38
  • To do this, you're probably going to have to define what you mean by "HTML" and "plain text", for example: Will you allow someone to put "" in the plain text, which looks *like* a HTML element but isn't, and also, what characters will you allow.. – Rob Oct 15 '08 at 13:15
  • In my case, I'm fine saying no tags at all, so wouldn't be allowed. My users are a limited number of employees that enter products into our company website. They have started to abuse the fields a little and include HTML in fields that weren't designed to contain HTML. – Ben Mills Oct 15 '08 at 17:29

7 Answers7

67

The following will match any matching set of tags. i.e. <b>this</b>

Regex tagRegex = new Regex(@"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>");

The following will match any single tag. i.e. <b> (it doesn't have to be closed).

Regex tagRegex = new Regex(@"<[^>]+>");

You can then use it like so

bool hasTags = tagRegex.IsMatch(myString);
ICR
  • 13,896
  • 4
  • 50
  • 78
  • The second one matches "a < b, b > c" – Jeroen K Oct 15 '18 at 09:11
  • I like the first one, but it does not contain cases where there are duplicating matches. For example, a string like "texttexttexttext", it will only match "texttext" and "text", igoring the overlapping "texttexttext" – Skepti_Capy Oct 27 '21 at 14:48
24

You could ensure plain text by encoding the input using HttpUtility.HtmlEncode.

In fact, depending on how strict you want the check to be, you could use it to determine if the string contains HTML:

bool containsHTML = (myString != HttpUtility.HtmlEncode(myString));
J c
  • 6,387
  • 3
  • 29
  • 29
15

Here you go:

using System.Text.RegularExpressions;
private bool ContainsHTML(string checkString)
{
  return Regex.IsMatch(checkString, "<(.|\n)*?>");
}

That is the simplest way, since items in brackets are unlikely to occur naturally.

Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
Josef
  • 7,431
  • 3
  • 31
  • 33
  • brackets are unlikely to occur naturally?! I don't follow. if somebody types "if x < 0 or y > 10" this regex will capture "< 0 or y >" Yet, there is no HTML in my example. RegEx as an HTML parser has generally been frowned upon. – ripvlan Feb 01 '21 at 20:04
8

I just tried my XElement.Parse solution. I created an extension method on the string class so I can reuse the code easily:

public static bool ContainsXHTML(this string input)
{
    try
    {
        XElement x = XElement.Parse("<wrapper>" + input + "</wrapper>");
        return !(x.DescendantNodes().Count() == 1 && x.DescendantNodes().First().NodeType == XmlNodeType.Text);
    }
    catch (XmlException ex)
    {
        return true;
    }
}

One problem I found was that plain text ampersand and less than characters cause an XmlException and indicate that the field contains HTML (which is wrong). To fix this, the input string passed in first needs to have the ampersands and less than characters converted to their equivalent XHTML entities. I wrote another extension method to do that:

public static string ConvertXHTMLEntities(this string input)
{
    // Convert all ampersands to the ampersand entity.
    string output = input;
    output = output.Replace("&amp;", "amp_token");
    output = output.Replace("&", "&amp;");
    output = output.Replace("amp_token", "&amp;");

    // Convert less than to the less than entity (without messing up tags).
    output = output.Replace("< ", "&lt; ");
    return output;
}

Now I can take a user submitted string and check that it doesn't contain HTML using the following code:

bool ContainsHTML = UserEnteredString.ConvertXHTMLEntities().ContainsXHTML();

I'm not sure if this is bullet proof, but I think it's good enough for my situation.

Ash
  • 2,108
  • 2
  • 17
  • 22
Ben Mills
  • 27,454
  • 14
  • 42
  • 38
  • You're checking to make sure that it doesn't contain XHTML. You're not checking to make sure that it doesn't contain HTML, which doesn't have to be well-formed XML. Also, your code will not catch "this is XHTML". – Robert Rossney Oct 15 '08 at 18:44
  • Actually, old style HTML that is not well formed XML will cause the XElement.Parse method to fail. My method assumes that the Parse method failing means that the string contains some form of HTML. I guess my code really looks for any form of tags. – Ben Mills Oct 16 '08 at 20:48
  • we may also use regex patten, to check opening closing tags. – bijayk Jan 22 '13 at 07:12
7

this also checks for things like < br /> self enclosed tags with optional whitespace. the list does not contain new html5 tags.

internal static class HtmlExts
{
    public static bool containsHtmlTag(this string text, string tag)
    {
        var pattern = @"<\s*" + tag + @"\s*\/?>";
        return Regex.IsMatch(text, pattern, RegexOptions.IgnoreCase);
    }

    public static bool containsHtmlTags(this string text, string tags)
    {
        var ba = tags.Split('|').Select(x => new {tag = x, hastag = text.containsHtmlTag(x)}).Where(x => x.hastag);

        return ba.Count() > 0;
    }

    public static bool containsHtmlTags(this string text)
    {
        return
            text.containsHtmlTags(
                "a|abbr|acronym|address|area|b|base|bdo|big|blockquote|body|br|button|caption|cite|code|col|colgroup|dd|del|dfn|div|dl|DOCTYPE|dt|em|fieldset|form|h1|h2|h3|h4|h5|h6|head|html|hr|i|img|input|ins|kbd|label|legend|li|link|map|meta|noscript|object|ol|optgroup|option|p|param|pre|q|samp|script|select|small|span|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|ul|var");
    }
}
kns98
  • 330
  • 2
  • 7
2

Angle brackets may not be your only challenge. Other characters can also be potentially harmful script injection. Such as the common double hyphen "--", which can also used in SQL injection. And there are others.

On an ASP.Net page, if validateRequest = true in machine.config, web.config or the page directive, the user will get an error page stating "A potentially dangerous Request.Form value was detected from the client" if an HTML tag or various other potential script-injection attacks are detected. You probably want to avoid this and provide a more elegant, less-scary UI experience.

You could test for both the opening and closing tags <> using a regular expression, and allow the text if only one of them occcurs. Allow < or >, but not < followed by some text and then >, in that order.

You could allow angle brackets and HtmlEncode the text to preserve them when the data is persisted.

DOK
  • 32,337
  • 7
  • 60
  • 92
  • If your strategy for dealing with SQL injection is stripping "--" out of input, you have a bigger problem. – Robert Rossney Oct 15 '08 at 18:40
  • 1
    Excellent point, Robert, but I didn't think this was the place to launch into a full explanation of defense against SQL injection, or other script injection techniques. My first line of defense against SQL injection is using parameterized SQL. What's yours? – DOK Oct 17 '08 at 14:48
0

Beware when using the HttpUtility.HtmlEncode method mentioned above. If you are checking some text with special characters, but not HTML, it will evaluate incorrectly. Maybe that's why J c used "...depending on how strict you want the check to be..."

Mark
  • 1,455
  • 3
  • 28
  • 51