18

I would like to store a JSON's contents in a HTML document's source, inside a script tag.

The content of that JSON does depend on user submitted input, thus great care is needed to sanitise that string for XSS.

I've read two concept here on SO.

1. Replace all occurrences of the </script tag into <\/script, or replace all </ into <\/ server side.

Code wise it looks like the following (using Python and jinja2 for the example):

// view
data = {
    'test': 'asdas</script><b>as\'da</b><b>as"da</b>',
}

context_dict = {
    'data_json': json.dumps(data, ensure_ascii=False).replace('</script', r'<\/script'),
}

// template
<script>
    var data_json = {{ data_json | safe }};
</script>

// js
access it simply as window.data_json object

2. Encode the data as a HTML entity encoded JSON string, and unescape + parse it in client side. Unescape is from this answer: https://stackoverflow.com/a/34064434/518169

// view
context_dict = {
    'data_json': json.dumps(data, ensure_ascii=False),
}

// template
<script>
    var data_json = '{{ data_json }}'; // encoded into HTML entities, like &lt; &gt; &amp;
</script>

// js
function htmlDecode(input) {
  var doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

var decoded = htmlDecode(window.data_json);
var data_json = JSON.parse(decoded);

This method doesn't work because \" in a script source becames " in a JS variable. Also, it creates a much bigger HTML document and also is not really human readable, so I'd go with the first one if it doesn't mean a huge security risk.

Is there any security risk in using the first version? Is it enough to sanitise a JSON encoded string with .replace('</script', r'<\/script')?

Reference on SO:
Best way to store JSON in an HTML attribute?
Why split the <script> tag when writing it with document.write()?
Script tag in JavaScript string
Sanitize <script> element contents
Escape </ in script tag contents

Some great external resources about this issue:
Flask's tojson filter's implementation source
Rail's json_escape method's help and source
A 5 year long discussion in Django ticket and proposed code

Community
  • 1
  • 1
hyperknot
  • 13,454
  • 24
  • 98
  • 153
  • 4
    You should encode `<`, `>`, and `&` as HTML entities. – Pointy Aug 28 '16 at 16:40
  • 6
    I've spent an hour writing this question, including reference to all previous SO answers I found. Receiving a one liner and a close / -1 does not feel helpful at all. – hyperknot Aug 28 '16 at 16:53
  • Well the simple fact is that that's all you need to do: whenever user-supplied content is going to be included as part of the page markup, encode those characters. – Pointy Aug 28 '16 at 16:56
  • If you're including user-supplied content as part of a *script* body, then that doesn't work of course. In that case, encoding `/` as `\/` in string constants is all you need (and that's generally done by any JSON encoder, as it's required by the JSON spec). – Pointy Aug 28 '16 at 16:58
  • 2
    At least `JSON.stringify()` and Python's `json.dumps()` doesn't escape `/` into `\/`. I'm looking for an automated way, which uses either the script tag parser to decode JSON or `JSON.parse()` on a string. Escaping manually on the server side would need something manual on the client side as well. – hyperknot Aug 28 '16 at 17:07
  • @zsero: As I see you did the right research. In one of the links they mentioned that this is basically a bug in the html specification, which is sad. It seems the only safe way is not to generate json into html but load it from a separate endpoint. If you really have to, you could write your own `dumps` function, which escapes slashes too. (`JSON.parse` will still decode it correctly) – Tamas Hegedus Aug 28 '16 at 17:52
  • 3
    Since the last comment, I've found the `|tojson` filter's implementation in Flask to be the best resource. The source code as well as some really important comments are written there. https://github.com/pallets/flask/blob/78a71a48dcb71cb930d747d9facef0dfa5a8f022/flask/json.py#L158 My understanding of the correct approach is the following then: 1. Use method 1. from my question. 2. encode <, >, & and ' into u00 form (not HTML entities!). 3. Double check if the JSON encoder escapes `\\` or not, as it depends from implementation to implementation (or even changed mid-version sometimes). – hyperknot Aug 28 '16 at 18:06
  • 1
    @zsero backslashes must always be escaped in string literals. Where did you find an implementation which didn't? – Tamas Hegedus Aug 28 '16 at 18:17
  • @TamasHegedus it's not *backslashes* that are the problem; it's *slashes* (`/`). You don't have to quote those in string constants in JavaScript, but for JSON it's a good idea for the very reason explored in this question. – Pointy Aug 28 '16 at 20:25
  • 2
    See http://archive.oreilly.com/pub/a/actionscript/excerpts/as3-cookbook/appendix.html for a list of "u00" substitutes for <, >, &, quotes and slashes. – catamphetamine Dec 15 '16 at 12:59
  • Are you using any particular JS framework in this project? Something like Angular, or React, might come in handy. – Adriano Oct 25 '17 at 03:20

2 Answers2

6

Here's how I dealt with the relatively minor part of this issue, the encoding problem with storing JSON in a script element. The short answer is you have to escape either < or / as together they terminate the script element -- even inside a JSON string literal. You can't HTML-encode entities for a script element. You could JavaScript-backslash-escape the slash. I preferred to JavaScript-hex-escape the less-than angle-bracket as \u003C.

.replace('<', r'\u003C')

I ran into this problem trying to pass the json from oembed results. Some of them contain script close tags (without mentioning Twitter by name).

json_for_script = json.dumps(data).replace('<', r'\u003C');

This turns data = {'test': 'foo </script> bar'}; into

'{"test": "foo \\u003C/script> bar"}'

which is valid JSON that won't terminate a script element.

I got the idea from this little gem inside the Jinja template engine. It's what's run when you use the {{data|tojson}} filter.

def htmlsafe_json_dumps(obj, dumper=None, **kwargs):
    """Works exactly like :func:`dumps` but is safe for use in ``<script>``
    tags.  It accepts the same arguments and returns a JSON string.  Note that
    this is available in templates through the ``|tojson`` filter which will
    also mark the result as safe.  Due to how this function escapes certain
    characters this is safe even if used outside of ``<script>`` tags.
    The following characters are escaped in strings:
    -   ``<``
    -   ``>``
    -   ``&``
    -   ``'``
    This makes it safe to embed such strings in any place in HTML with the
    notable exception of double quoted attributes.  In that case single
    quote your attributes or HTML escape it in addition.
    """
    if dumper is None:
        dumper = json.dumps
    rv = dumper(obj, **kwargs) \
        .replace(u'<', u'\\u003c') \
        .replace(u'>', u'\\u003e') \
        .replace(u'&', u'\\u0026') \
        .replace(u"'", u'\\u0027')
    return Markup(rv)

(You could use \x3C instead of \u003C and that would work in a script element because it's valid JavaScript. But might as well stick to valid JSON.)

Bob Stein
  • 16,271
  • 10
  • 88
  • 101
  • @hyperknot Now I see your comment where you linked to this same routine years ago. Man I wish I'd see that earlier. The odyssey I went through to find it. Oh well, it's actually reassuring. I'll let this answer stand as `.replace('<', r'\x3C')` is I think a handy answer to your question. – Bob Stein Sep 05 '19 at 12:17
1

First of all, your paranoia is well founded.

  • an HTML-parser could be tricked by a closing script tag (better assume by any closing tag)
  • a JS-parser could be tricked by backslashes and quotes (with a really bad encoder)

Yes, it would be much "safer" to encode all characters that could confuse the different parsers involved. Keeping it human-readable might be contradicting your security paradigm.

Note: The result of JSON String encoding should be canoncical and OFC, not broken, as in parsable. JSON is a subset of JS and thus be JS parsable without any risk. So all you have to do is make sure the HTML-Parser instance that extracts the JS-code is not tricked by your user data.

So the real pitfall is the nesting of both parsers. Actually, I would urge you to put something like that into a separate request. That way you would avoid that scenario completely.

Assuming all possible styles and error-corrections that could happen in such a parser it might be that other tags (open or close) might achieve a similar feat.

As in: suggesting to the parser that the script tag has ended implicitly.

So it is advisable to encode slash and all tag braces (/,<,>), not just the closing of a script-tag, in whatever reversible method you choose, as long as long as it would not confuse the HTML-Parser:

  • Best choice would be base64 (but you want more readable)
  • HTMLentities will do, although confusing humans :)
  • Doing your own escaping will work as well, just escape the individual characters rather than the </script fragment

In conclusion, yes, it's probably best with a few changes, but please note that you will be one step away from "safe" already, by trying something like this in the first place, instead of loading the JSON via XHR or at least using a rigorous string encoding like base64.

P.S.: If you can learn from other people's code encoding the strings that's nice, but you should not resort to "libraries" or other people's functions if they don't do exactly what you need. So rather write and thoroughly test your own (de/en)coder and know that this pitfall has been sealed.

KenHBS
  • 6,756
  • 6
  • 37
  • 52
anx
  • 282
  • 2
  • 8
  • A HTML parser is not "tricked" by a closing script tag; it is recognizing the end tag within the so-called non-replaceable character data, in the required, documented manner. – Kaz May 29 '21 at 14:41