6

I am using Newtonsoft JSON deserializer. How can one clean JSON for XSS (cross site scripting)? Either cleaning the JSON string before de-serializing or writing some kind of custom converter/sanitizer? If so - I am not 100% sure about the best way to approach this.

Below is an example of JSON that has a dangerous script injected and needs "cleaning." I want a want to manage this before I de-serialize it. But we need to assume all kinds of XSS scenarios, including BASE64 encoded script etc, so the problem is more complex that a simple REGEX string replace.

{ "MyVar" : "hello<script>bad script code</script>world" } 

Here is a snapshot of my deserializer ( JSON -> Object ):

public T Deserialize<T>(string json)
{
    T obj;

    var JSON = cleanJSON(json); //OPTION 1 sanitize here

    var customConverter = new JSONSanitizer();// OPTION 2 create a custom converter

    obj = JsonConvert.DeserializeObject<T>(json, customConverter);

    return obj;
}

JSON is posted from a 3rd party UI interface, so it's fairly exposed, hence the server-side validation. From there, it gets serialized into all kinds of objects and is usually stored in a DB, later to be retrieved and outputted directly in HTML based UI so script injection must be mitigated.

Gray
  • 7,050
  • 2
  • 29
  • 52
MarzSocks
  • 4,229
  • 3
  • 22
  • 35
  • 1
    I've update my question to address what I mean by "cleanup". – MarzSocks Sep 21 '15 at 16:03
  • It depends on the context. Could you provide some details with how the data will be displayed? Will it contain URL data? Is it going to be placed straight into the HTML? Is it going to be accessed from javascript only? is it an HTML attribute? XSS prevention really depends on the context. – Gray Sep 21 '15 at 17:18
  • 1
    JSON is posted from a 3rd party UI interface, so its fairly exposed & hence the server side validation. From there it gets serialized into all kinds of objects and usually stored in a DB, later to be retrieved and outputted directly in HTML based UI so script tags must be controlled. Ideally want to clean it before it even enters the logic layer of the application and the serializer is the one place to rule them all. :-) – MarzSocks Sep 21 '15 at 18:11
  • Ah, so it actually contains HTML data that is supposed to render? You're going to need to parse that HTML against a whitelist and strip attributes (either all or a whitelist - depending on what you are doing). Even then you won't be done since you'll need to make sure if you allow certain tags (like `a` tags), you'll need to validate them (for example, no `javascript:` or `data:` schemes - only http(s)/whatever else you expect. – Gray Sep 21 '15 at 18:16
  • Exactly, its actually a fairly complex task if you want to do it properly. JSON XSS must be a fairly standard problem - was hoping that perhaps there may be a standard way of dealing with it. – MarzSocks Sep 21 '15 at 18:25
  • It's not so much that it comes from JSON - that doesn't even really matter. Your problem is really just a problem of sanitizing HTML. It is a simple matter of encoding `<`,`>`, `'`, `"`, and `&` (maybe others) otherwise. But you'll definitely need an HTML parser to do this. [Don't try to do it with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Gray Sep 21 '15 at 18:28
  • Exactly, so the question is how to encode just the JSON strings. :-) There is another catch here, some XSS is done by converting to BASE64 strings which can actually be parsed by the browser. So sanitizing cant be done on assumed text only. These guys are smart. – MarzSocks Sep 21 '15 at 18:36

2 Answers2

3

Ok, I am going to try to keep this rather short, because this is a lot of work to write up the whole thing. But, essentially, you need to focus on the context of the data you need to sanitize. From comments on the original post, it sounds like some values in the JSON will be used as HTML that will be rendered, and this HTML comes from an un-trusted source.

The first step is to extract whichever JSON values need to be sanitized as HTML, and for each of those objects you need to run them through an HTML parser and strip away everything that is not in a whitelist. Don't forget that you will also need a whitelist for attributes.

HTML Agility Pack is a good starting place for parsing HTML in C#. How to do this part is a separate question in my opinion - and probably a duplicate of the linked question.

Your worry about base64 strings seems a little over-emphasized in my opinion. It's not like you can simply put aW5zZXJ0IGg0eCBoZXJl into an HTML document and the browser will render it. It can be abused through javascript (which your whitelist will prevent) and, to some extent, through data: urls (but this isn't THAT bad, as javascript will run in the context of the data page. Not good, but you aren't automatically gobbling up cookies with this). If you have to allow a tags, part of the process needs to be validating that the URL is http(s) (or whatever schemes you want to allow).

Ideally, you would avoid this uncomfortable situation, and instead use something like markdown - then you could simply escape the HTML string, but this is not always something we can control. You'd still have to do some URL validation though.

Community
  • 1
  • 1
Gray
  • 7,050
  • 2
  • 29
  • 52
  • 1
    I ended up taking this route. Used the HTML Agility Pack, and sanitized string values during conversion to JSON. – MarzSocks Sep 22 '15 at 14:31
  • 1
    Not sure if you are saying you sanitize them BEFORE storing them, but if you are, you might want to at least store the original just in case there's a bug and you corrupt some data. If it's third-party and you aren't storing it at all, then that's fine either way. Glad that it was useful. – Gray Sep 22 '15 at 14:32
2

Interesting!! Thanks for asking. we normally use html.urlencode in terms of web forms. I have a enterprise web api running that has validations like this. We have created a custom regex to validate. Please have a look at this MSDN link.

This is the sample model created to parse the request named KeyValue (say)

public class KeyValue
{
    public string Key { get; set; }
}

Step 1: Trying with a custom regex

var json = @"[{ 'MyVar' : 'hello<script>bad script code</script>world' }]";

        JArray readArray = JArray.Parse(json);
        IList<KeyValue> blogPost = readArray.Select(p => new KeyValue { Key = (string)p["MyVar"] }).ToList();

        if (!Regex.IsMatch(blogPost.ToString(),
           @"^[\p{L}\p{Zs}\p{Lu}\p{Ll}\']{1,40}$"))
            Console.WriteLine("InValid");
            //           ^ means start looking at this position.
            //           \p{ ..} matches any character in the named character class specified by {..}.
            //           {L} performs a left-to-right match.
            //           {Lu} performs a match of uppercase.
            //           {Ll} performs a match of lowercase.
            //           {Zs} matches separator and space.
            //           'matches apostrophe.
            //            {1,40} specifies the number of characters: no less than 1 and no more than 40.
            //            $ means stop looking at this position.

Step 2: Using HttpUtility.UrlEncode - this newtonsoft website link suggests the below implementation.

string json = @"[{ 'MyVar' : 'hello<script>bad script code</script>world' }]";

        JArray readArray = JArray.Parse(json);
        IList<KeyValue> blogPost = readArray.Select(p => new KeyValue {Key =HttpUtility.UrlEncode((string)p["MyVar"])}).ToList();
staticvoidmain
  • 793
  • 6
  • 14