50

Is there a library or acceptable method for sanitizing the input to an html page?

In this case I have a form with just a name, phone number, and email address.

Code must be C#.

For example:

"<script src='bobs.js'>John Doe</script>" should become "John Doe"

Julian
  • 33,915
  • 22
  • 119
  • 174
NotMe
  • 87,343
  • 27
  • 171
  • 245
  • You MUST protect the output(e.g. see [Jeremy Cook answer below](https://stackoverflow.com/a/19188104/52277) ). Adding input sanitisation is an additional optional “nice to have” functionality, that only reduces the risk of XSS attacks, but not fully protect. – Michael Freidgeim Feb 03 '23 at 21:57

5 Answers5

73

We are using the HtmlSanitizer .Net library, which:

Also on NuGet

Julian
  • 33,915
  • 22
  • 119
  • 174
11

Based on the comment you made to this answer, you might find some useful info in this question:
https://stackoverflow.com/questions/72394/what-should-a-developer-know-before-building-a-public-web-site

Here's a parameterized query example. Instead of this:

string sql = "UPDATE UserRecord SET FirstName='" + txtFirstName.Text + "' WHERE UserID=" + UserID;

Do this:

SqlCommand cmd = new SqlCommand("UPDATE UserRecord SET FirstName= @FirstName WHERE UserID= @UserID");
cmd.Parameters.Add("@FirstName", SqlDbType.VarChar, 50).Value = txtFirstName.Text;
cmd.Parameters.Add("@UserID", SqlDbType.Integer).Value = UserID;

Edit: Since there was no injection, I removed the portion of the answer dealing with that. I left the basic parameterized query example, since that may still be useful to anyone else reading the question.
--Joel

Community
  • 1
  • 1
Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • Actually, no. I was just trying to be proactive with some new development. Great info though. – NotMe Oct 09 '08 at 20:07
  • Make sure you've seen the latest edit: I added a very useful link at the bottom. – Joel Coehoorn Oct 09 '08 at 20:12
  • 1
    BTW, I'm already using s'procs anyway. I just want to make sure that systems downstream (which I have absolutely no control over) don't incorrectly deal with the input. – NotMe Oct 09 '08 at 21:09
9

It sounds like you have users that submit content but you cannot fully trust them, and yet you still want to render the content they provide as super safe HTML. Here are three techniques: HTML encode everything, HTML encode and/or remove just the evil parts, or use a DSL that compiles to HTML you are comfortable with.

  1. Should it become "John Doe"? I would HTML encode that string and let the user, "John Doe" (if indeed that is his real name...), have the stupid looking name <script src='bobs.js'>John Doe</script>. He shouldn't have wrapped his name in script tags or any tags in the first place. This is the approach I use in all cases unless there is a really good business case for one of the other techniques.

  2. Accept HTML from the user and then sanitize it (on output) using a whitelist approach like the sanitization method @Bryant mentioned. Getting this right is (extremely) hard, and I defer pulling that off to greater minds. Note that some sanitizers will HTML encode evil where others would have removed the offending bits completely.

  3. Another approach is to use a DSL that "compiles" to HTML. Make sure to whitehat your DSL compiler because some (like MarkdownSharp) will allow arbitrary HTML like <script> tags and evil attributes through unencoded (which by the way is perfectly reasonable but may not be what you need or expect). If that is the case you will need to use technique #2 and sanitize what your compiler outputs.

Closing thoughts:

Jeremy Cook
  • 20,840
  • 9
  • 71
  • 77
8

If by sanitize you mean REMOVE the tags entirely, the RegEx example referenced by Bryant is the type of solution you want.

If you just want to ensure that the code DOESN'T mess with your design and render to the user. You can use the HttpUtility.HtmlEncode method to prevent against that!

Mitchel Sellers
  • 62,228
  • 14
  • 110
  • 173
  • 1
    Is there a reason to do that instead of the simpler regex by Jakub? – NotMe Oct 09 '08 at 21:10
  • The regex solution will remove the code, it works....but takes time. HtmlEncode, just formats it in a safe manner for web display. – Mitchel Sellers Oct 09 '08 at 21:18
  • 3
    Sanitizing HTML is notoriously tricky to get right. There are so many ways an attacker can get JavaScript code to fire. Consider `Click me`, and that's just the tip of the iceberg. Outputting HTML encoded user input is a surefire approach to render safe HTML. – Jeremy Cook Oct 04 '13 at 16:34
7

What about using Microsoft Anti-Cross Site Scripting Library?

Community
  • 1
  • 1
  • Interesting. When I have time I'll play with it. Looks promising though. – NotMe Nov 10 '09 at 23:48
  • 1
    The link above references v3.1 of the Anti-Cross Site Scripting Library. [Version 4.0 is the most current release](http://www.microsoft.com/download/en/details.aspx?id=5242). – CBono Oct 10 '11 at 13:39
  • the above link is outdated as well, edited the answer to include the correct link to the MSACSS Library – Adam May 23 '12 at 17:42
  • 2
    Obsolete in today's date – Nabeel Oct 13 '20 at 23:53