2

I see that many sites (amazon, wikipedia, others) use UTF8-encoded, URL-escaped unicode in their URLs, and those URLs are prettified by (at least) Chrome.

For example, we would represent http://ja.wikipedia.org/wiki/メインページ as http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 when writing our http headers, and Chrome and Firefox seem to understand this in a graceful way. (I didn't test on IE.)

Is there a governing standard for this behavior? Or is it strictly a de facto standard? Or is it completely non-standard?

I'd really like to see a link to the defining paragraph of some RFC.

bukzor
  • 37,539
  • 11
  • 77
  • 111
  • 3
    In _what_ standards? Please use links in your question. – Oded Feb 02 '12 at 15:51
  • 1
    Seriously though, without knowing _what_ standards you mean, how can anyone answer this? – Oded Feb 02 '12 at 16:24
  • 1
    What is an international URL? – Gumbo Feb 02 '12 at 16:32
  • 2
    Possible duplicates: [*What is the proper way to URL encode Unicode characters?*](http://stackoverflow.com/q/912811/53114), [*Unicode characters in URLs*](http://stackoverflow.com/q/2742852/53114) – Gumbo Feb 02 '12 at 16:38
  • @bukzor So you mean URLs with non-ASCII characters, right? – Gumbo Feb 02 '12 at 16:39
  • @Gumbo: Those are similar, and helpful, but not duplicate (imo). They discuss *how* to do unicode URLs, but not *why*. – bukzor Feb 02 '12 at 16:59
  • @Oded: This question is looking for a standard. It's part of the answer, not part of the question. – bukzor Jul 10 '12 at 20:19

2 Answers2

1

The URI standard says:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded.

That seems pretty definitive.

I'm still unsure about when it was ratified, or the current browser support.

Community
  • 1
  • 1
bukzor
  • 37,539
  • 11
  • 77
  • 111
  • 1
    And other standards depend on this. JavaScript's definition of [`encodeURI`](http://es5.github.com/#x15.1.3.3) and its dual `decodeURI` are defined thus: "each instance of certain characters is replaced by one, two or three escape sequences representing the UTF-8 encoding of the character." – Mike Samuel Feb 02 '12 at 16:53
  • 2
    The only problem is it says: “When a *new* URI scheme […]” – Gumbo Feb 02 '12 at 17:54
0

RFC 3987 is the new standard for handling International URI/URLs, known as IRIs. The old standard, RFC 3986, does not support Unicode. Anyone not using IRIs yet has to come up with their own way of encoding unsupported characters for their own needs. Percent-encoding UTF-8 octets is one way, but it is certainly not the only way that is actually in use.

Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770