31

I am having trouble understanding how escaping works inside html tag attribute values that are javascript.

I was lead to believe that you should always escape & ' " < > . So for javascript as an attribute value I tried:

<a href="javascript:alert(&apos;Hello&apos;);"></a>

It doesn't work. However:

<a href="javascript:alert(&#39;Hello&#39;);"></a>

and

<a href="javascript:alert('Hello');"></a>

does work in all browsers!

Now I am totally confused. If all my attribute values are enclosed in double quotes, does this mean I do not have to escape single quotes? Or is apos and ascii 39 technically different characters? Such that javascript requires ascii 39, but not apos?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Myforwik
  • 3,438
  • 5
  • 35
  • 42
  • OK, I just found out that ' is not actually a reference entity in HTML, despite what w3schools say (http://www.w3.org/TR/1998/REC-html40-19980424/sgml/entities.html) – Myforwik Feb 08 '12 at 04:51
  • I think `'` is well-defined since [HTML 5.0](https://www.w3.org/TR/html50/syntax.html#named-character-references). – Franklin Yu Jan 26 '18 at 15:50

2 Answers2

52

There are two types of “escapes” involved here, HTML and JavaScript. When interpreting an HTML document, the HTML escapes are parsed first.

As far as HTML is considered, the rules within an attribute value are the same as elsewhere plus one additional rule:

  • The less-than character < should be escaped. Usually &lt; is used for this. Technically, depending on HTML version, escaping is not always required, but it has always been good practice.
  • The ampersand & should be escaped. Usually &amp; is used for this. This, too, is not always obligatory, but it is simpler to do it always than to learn and remember when it is required.
  • The character that is used as delimiters around the attribute value must be escaped inside it. If you use the Ascii quotation mark " as delimiter, it is customary to escape its occurrences using &quot; whereas for the Ascii apostrophe, the entity reference &apos; is defined in some HTML versions only, so it it safest to use the numeric reference &#39; (or &#x27;).

You can escape > (or any other data character) if you like, but it is never needed.

On the JavaScript side, there are some escape mechanisms (with \) in string literals. But these are a different issue, and not relevant in your case.

In your example, on a browser that conforms to current specifications, the JavaScript interpreter sees exactly the same code alert('Hello');. The browser has “unescaped” &apos; or &#39; to '. I was somewhat surprised to hear that &apos; is not universally supported these days, but it’s not an issue: there is seldom any need to escape the Ascii apostrophe in HTML (escaping is only needed within attribute values and only if you use the Ascii apostrophe as its delimiter), and when there is, you can use the &#39; reference.

Pedro A
  • 3,989
  • 3
  • 32
  • 56
Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • Out of curiosity, could you please elaborate on when it's necessary to escape ampersands and when it isn't? – Rakesh Pai Jan 18 '13 at 06:31
  • 1
    @RakeshPai, that depends on HTML version. By HTML 4.01 rules, the ampersand must be escaped if immediately followed by an Ascii letter (a–z, A–Z) or if immediately followed by number sign `#` and an Ascii letter. – Jukka K. Korpela Jan 18 '13 at 08:32
  • Interesting. It makes sense since it will conflict with other types of HTML entities. Thanks. – Rakesh Pai Jan 18 '13 at 09:24
  • Assuming you're trying to defeat XSS, this advice is bad. http://wonko.com/post/html-escaping "Escaping &, <, >, ", ', `, , !, @, $, %, (, ), =, +, {, }, [, and ] is almost enough" – avgvstvs Jan 27 '17 at 19:29
  • 4
    As the article you are citing states, "All those characters up there (including the space character!) can be used to break out of an _unquoted_ HTML attribute value". While there probably isn't any downside (except performance) to escape all of these characters, it is much easier to just use quotes around HTML attribute values if you are expanding placeholders in them. – Florian Winter Nov 10 '17 at 11:40
  • @FlorianWinter That should be a reply to @avgvstvs? – Franklin Yu Jan 26 '18 at 16:04
  • Note that " is " and ' is for ' – Géza Mar 23 '18 at 10:10
  • 1
    I'm probably missing something obvious, but why does `<` need to be escaped but not `>`? – ahiijny Oct 26 '21 at 05:20
  • 1
    @ahiijny Because the `<` marks the start of the tag. You don't need to also escape the end of the tag if you didn't get into a tag in the first place. – Fighter178 Apr 01 '23 at 05:39
  • @Florian Winter So basically, if the html attribute value is quoted, than `text.replace(/"/g, """);` is enough. Like: `\`Hi\`` – Zoidbergseasharp Apr 25 '23 at 14:49
  • @Zoidbergseasharp No. As the article quoted above (https://wonko.com/post/html-escaping/) demonstrates, it is possible to break out of a quoted HTML attribute containing a placeholder without even using double-quotes. Simple Text replacement is often wrong. Context matters. Proper escaping to prevent XSS is more than one line of code and too complex to describe in a single StackOverflow comment. – Florian Winter Apr 27 '23 at 09:43
  • 1
    @Florian Winter I red the article and i dont see how you can exploit `Hi` Notice the attribute value is set in double quotes – Zoidbergseasharp May 09 '23 at 14:03
2

&apos; is not a valid HTML reference entity. You should escape using &#39;

BryanH
  • 5,826
  • 3
  • 34
  • 47
Myforwik
  • 3,438
  • 5
  • 35
  • 42