1

I'm trying to figure out what the the minimum amount of encoding would be that would protect a site from XSS.

I know for sure I'll need to encode < (&lt;) and > (&gt;) inside of tags, " (&quot;) and ' (&#39) inside attributes.

Do I also need to encode & (&amp;)? I was having trouble with double encoding when the user was saving data (because &amp; would become &amp;amp;). Are there any security vulnerabilities or downsides that would happy if I didn't encode the ampersands? This would mean they'd be able to input any HTML entities they wanted.

By HTML entities I specifically mean ampersand-prefixed sequences that correspond to entities (like © ™).

This question is language-agnostic (except for the HTML part, of course).

Edit: heh, stack-overflow lets me keep my html encoded entities :) That might be telling.

Paul
  • 4,422
  • 5
  • 29
  • 55
  • [`htmlentities`](http://us2.php.net/manual/en/function.htmlentities.php) – Waleed Khan Jan 17 '13 at 15:26
  • also strip tags if you don't want to allow html elements in your text ( http://php.net/manual/en/function.strip-tags.php ) – Vlad Preda Jan 17 '13 at 15:28
  • re your edit: SO is a site aimed at developers. If you couldn't enter HTML into questions and answers, it would render the site virtually useless for a large portion of the users. Be assured that they are sanitising the input thoroughly though - SO is a high-enough profile site that it is bound to be attracting a lot of hacking attempts, so you can be sure they've got all the bases covered. – SDC Jan 17 '13 at 15:36
  • @SDC is there a vulnerability that you know of for HTML entities if the ampersand isn't encoded? They definitely are encoding < and >. – Paul Jan 17 '13 at 16:22
  • @ShyGuy: If the user enters the entity string `&` you should encode it. How are you to know he didn't actually mean `&`? Maybe he's discussing HTML encoding? Anyway, if you don't encode it the same as all the other characters, then when you come to decode it, you'll have a mismatch. Then there could indeed by a vulnerablilty, if he enters `&lt;script&gt;` -- if you haven't fully encoded it all as he entered it, it could be decoded to a ` – SDC Jan 17 '13 at 16:32
  • @SDC I'd never decode it, ever. I want to encode on display. I want to somehow allow those special symbols like TM and (C), but not if it leaves me vulnerable to attack. – Paul Jan 17 '13 at 16:47
  • TM and (C) can be displayed without encoding if you use UTF-8 character set. That's the best way to deal with that kind of thing. Encoding is only really necessary for `<`, `>` and `&`. – SDC Jan 17 '13 at 16:54
  • Relates: [Is there a security risk in leaving ampersands unescaped in user-submitted data?](http://stackoverflow.com/a/11038730/53114) – Gumbo Jan 17 '13 at 22:28

1 Answers1

1

You only need to encode these entities if you are displaying them on a page (and & needs to be escaped just as much as > and < because it is the escape sequence identifier).

If you're having trouble with double encoding of & signs, it sounds like you're doing it before you insert the data into your storage mechanism (database?) Stop that. You should only escape the data for the page when it comes to display on the page.

Ashley Sheridan
  • 526
  • 3
  • 6
  • Yes, I'm trying to figure out what the danger is of allowing the escape sequences. Can you show me an attack that can occur if I don't encode my ampersands? The double encoding happens, for instance, when I'm storing parts of the object to be saved in a data-attributed. I encode it on display -> data-name="ShyGuy&copy;". If they edit something else in the object, I'll send data-name back up with the form, which will make it get saved in the DB as "ShyGuy&copy", which next time I output will by "ShyGuy&&copy". – Paul Jan 17 '13 at 15:33
  • A clean way: store chars verbatim as entered by user (SQL escaped if needed; therefore SQL escaping will not be stored). Then, entity-encode everything upon display. Once. That's the tricky part. Alternate: strip out, or prohibit, – OsamaBinLogin Nov 12 '14 at 20:36