Is it standard to UTF8 + escape our international URLs?

Question

I see that many sites (amazon, wikipedia, others) use UTF8-encoded, URL-escaped unicode in their URLs, and those URLs are prettified by (at least) Chrome.

For example, we would represent http://ja.wikipedia.org/wiki/メインページ as http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 when writing our http headers, and Chrome and Firefox seem to understand this in a graceful way. (I didn't test on IE.)

Is there a governing standard for this behavior? Or is it strictly a de facto standard? Or is it completely non-standard?

I'd really like to see a link to the defining paragraph of some RFC.

Seriously though, without knowing _what_ standards you mean, how can anyone answer this? — Oded, Feb 02 '12 at 16:24
Possible duplicates: [*What is the proper way to URL encode Unicode characters?*](http://stackoverflow.com/q/912811/53114), [*Unicode characters in URLs*](http://stackoverflow.com/q/2742852/53114) — Gumbo, Feb 02 '12 at 16:38
@Gumbo: Those are similar, and helpful, but not duplicate (imo). They discuss *how* to do unicode URLs, but not *why*. — bukzor, Feb 02 '12 at 16:59
@Oded: This question is looking for a standard. It's part of the answer, not part of the question. — bukzor, Jul 10 '12 at 20:19

score 1 · Answer 1 · edited Oct 07 '21 at 06:06

1

The URI standard says:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded.

That seems pretty definitive.

I'm still unsure about when it was ratified, or the current browser support.

edited Oct 07 '21 at 06:06

Community

1
1

answered Feb 02 '12 at 16:51

bukzor

37,539
11
77
111

1

And other standards depend on this. JavaScript's definition of [`encodeURI`](http://es5.github.com/#x15.1.3.3) and its dual `decodeURI` are defined thus: "each instance of certain characters is replaced by one, two or three escape sequences representing the UTF-8 encoding of the character." – Mike Samuel Feb 02 '12 at 16:53
2

The only problem is it says: “When a *new* URI scheme […]” – Gumbo Feb 02 '12 at 17:54

score 0 · Answer 2 · edited Oct 07 '21 at 06:31

0

RFC 3987 is the new standard for handling International URI/URLs, known as IRIs. The old standard, RFC 3986, does not support Unicode. Anyone not using IRIs yet has to come up with their own way of encoding unsupported characters for their own needs. Percent-encoding UTF-8 octets is one way, but it is certainly not the only way that is actually in use.

edited Oct 07 '21 at 06:31

Community

1
1

answered Feb 02 '12 at 23:19

Remy Lebeau

555,201
31
458
770

Is it standard to UTF8 + escape our international URLs?

2 Answers2