I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
-
possible duplicate of [Unicode characters in URLs](http://stackoverflow.com/questions/2742852/unicode-characters-in-urls) – Rafa Viotti Apr 17 '15 at 03:15
1 Answers
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example
in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead
which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6
could be pretty much anything, and %C3%B6
could be e.g. UTF-16.

- 118,630
- 17
- 138
- 146

- 175,061
- 34
- 275
- 318
-
You probably mean "Unicode" when you write "UTF-8". That doesn't fundamentally change my answer, either way. – tripleee Mar 12 '14 at 16:33
-
45Actually they both amount to "no". Neither domains nor URLs can contain any non-ASCII characters. *However*, there exist ways to encode arbitrary characters as ASCII (percent encoding and punycode)... – deceze Mar 12 '14 at 17:47
-
1+1 @deceze (-: Well, yes. Canonical URLs do not contain Unicode. But the IDNA effort in particular is very much about defining and enabling a human-friendly semi-canonical representation. – tripleee Mar 12 '14 at 19:12
-
Another detail is url parameter content can be url-encoded unicode utf-8. What happens is before the http request is made the parameter data is url-encoded, either the full url or the parameter data. As the query syntax elements like ? and & are already ascii only parameter data will be touched by encoding. A degree symbol ° is ASCII 0xB2 but translates C2 B2 in UTF-8 https://stackoverflow.com/a/8732093/4299943. Unencoded ° 1-byte will result in ? and sending only %B2 results in �, unless the server is not decoding single byte Windows-1252. https://www.w3schools.com/tags/ref_urlencode.ASP – flodis Oct 27 '21 at 14:44
-
1@flodis Doesn't that simply reiterate information which is already in the answer? Probably don't use w3schools as your reference anyway. – tripleee Oct 27 '21 at 14:46