6

Is the ampersand the only character that should be encoded in an HTML attribute?

It's well known that this won't pass validation:

<a href="http://domain.com/search?q=whatever&lang=en"></a>

Because the ampersand should be &amp;. Here's a direct link to the validation fail.

This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.

In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.

I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
sohtimsso1970
  • 3,216
  • 4
  • 28
  • 38
  • 1
    URL encoding does not apply to **the entire URL**. And `&` is an HTML entity, not a URL-encoded string. You seem to be getting the two mixed up. – BoltClock Sep 17 '11 at 16:50
  • 1
    No, I get that that's an entity reference. That's my point - what other characters are encoded? What if there's a space in the URL path? I'm looking for a definitive guide about this. Most reference guides don't seem to cover this really well. They only talk about _total_ URL encoding. – sohtimsso1970 Sep 17 '11 at 16:55
  • Afaik, you don't have to encode any characters in attribute values, as they are not interpreted as HTML anyway. As @BoltClock says, maybe you are talking about URL encoding? But that has nothing to do with HTML entities or HTML at all. [Wikipedia](https://secure.wikimedia.org/wikipedia/en/wiki/Url_encoding) helps you in this case. – Felix Kling Sep 17 '11 at 16:55
  • 2
    @Felix All the HTML validators will scream if you leave naked ampersands in attributes. [Common mistakes in HTML](http://htmlhelp.com/tools/validator/problems.html#amp) – sohtimsso1970 Sep 17 '11 at 16:58
  • Yeah true, seems to be the case... – Felix Kling Sep 17 '11 at 17:00
  • @sohtimsso1970 Yes but what does that have to do with URL encoding, which is shown in the link in your question? – BoltClock Sep 17 '11 at 17:00
  • @sohtimsso1970 Validators have to be rewritten to use HTML5 parsing rules, `&` may or may not start character references. unescaped '&' is perfectly valid in attribute values in HTML5 and de facto in HTML4 (all browsers use current HTML5 rule). – c-smile Sep 17 '11 at 17:57
  • @c-smile - did you check the example sohtimsso1970 gave in the HTML5 validator [validator.nu](http://validator.nu/?doc=http%3A%2F%2Ffiddle.jshell.net%2FRFLR5%2Fshow%2F&showsource=yes) You'll see that it is indeed invalid in HTML5. – Alohci Sep 17 '11 at 18:05
  • @Alohci - philosophical question: who will validate validators? Validators are not free from errors as any other applications. – c-smile Sep 17 '11 at 18:53

5 Answers5

9

Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:

http://query.com/?q=foo&lt=bar&gt=baz

Even though there aren't trailing semi-colons, since &lt; is the entity for < and &gt; is the entity for >, some old browsers would translate this URL to:

http://query.com/?q=foo<=bar>=baz

So you need to specify & as &amp; to prevent this from occurring for links within an HTML parsed document.

mVChr
  • 49,587
  • 11
  • 107
  • 104
1

The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:

http://example.com/?parameter1=<ENCODED VALUE>&parameter2=<ENCODED VALUE>

The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:

?q=whatever&lang=en

Will actually be translated by the recipient as two parameters:

q = "whatever"
lang = "en"

For your url to work you just need to ensure that your values are being encoded:

?q=<ENCODED VALUE>&lang=<ENCODED VALUE>

Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (&copy for example). Here is a test in jsfiddle showing the url:

http://jsfiddle.net/YjPHA/1/

In Chrome and FireFox the links works correctly, but IE renders &copy as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).

To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.

Chris Van Opstal
  • 36,423
  • 9
  • 73
  • 90
  • Thanks, Chris, for the reply, but I don't think that's true. [Numerous](http://htmlhelp.com/tools/validator/problems.html#amp) [sources](http://mrcoles.com/blog/how-use-amersands-html-encode/) [point](http://christopherschmitt.com/2008/07/30/validated-ampersands-in-html-links/) out that all ampersands are to be encoded in HTML. The W3C validator itself will not validate your content if you don't encode them. – sohtimsso1970 Sep 17 '11 at 17:06
1

You do not need HTML escapement here:

<a href="http://domain.com/search?q=whatever&lang=en"></a>

According to the HTML5 spec: http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state

&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en

For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html

c-smile
  • 26,734
  • 7
  • 59
  • 86
  • 1
    Hmm - yes I see your point. While parse errors are not the same as validity errors, I can find nothing in the HTML5 spec to indicate that the example you give is invalid. The HTML5 spec says (non-normatively) `Bill and Ted ` but validator.nu calls this out as an error. I think you should probably raise it as a bug on validator.nu if you haven't already done so. – Alohci Sep 18 '11 at 00:17
1

In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have &quot;, &amp; and &nbsp; in the markup.

For " though, you don't have to use &quot; if you use single quotes to encase your attribute values.

For HTML text nodes, in addition to the above, if you want < and > as a result, you should use &lt; and &gt;. (I'd even use these in attribute values too.)

For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).

Shadow2531
  • 11,980
  • 5
  • 35
  • 48
0

If I understand the question correctly, I believe this is what you want.

Tyler Crompton
  • 12,284
  • 14
  • 65
  • 94