27

RFC 1738 specifies the syntax for URL's, and mentions that

URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
control characters; these must be encoded.

It does not, however, say what code set these octets then represent.

RFC 2396 seems to try and improve on the situation, but:

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used.

It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.

Is there any unambigous way in which a client can determine in which character set to interpret encoded octets, or in which a server can determine what a client used to encode with ?

It looks to me like most servers default to UTF-8, but this seems to be a de facto choice more than a specified one.

Thomas Vander Stichele
  • 36,043
  • 14
  • 56
  • 60

2 Answers2

12

As per your quote, URLs are ASCII. That's all.

URIs OTOH, allow for bigger charsets; usually UTF-8 as you said yourself.

The point to remember is that URLs are a subset of URIs. Therefore, the real question is, which of these is what you write in a browser?

I'd guess you can write an URI, and the browser should try its best to transform to an URL (which is what HTTP/1.1 support, AFAICR). For non-ASCII characters, that means hexcodes, usually coding UTF-8.

pergy
  • 5,285
  • 1
  • 22
  • 36
Javier
  • 60,510
  • 8
  • 78
  • 126
  • 2
    URLs are opaque identifiers that have no character encoding, the opaque identifier can be considered a binary string of characters that only has a meaning to the target host they are intended. The target host can if it so wishes apply a character-set interpretation of the URL data. This means the client has no control over the meaning or character set and no way to express a choice since the interpretation of the URL is 100% a matter for the server. So to answer the original question you can not assume any character-set it is server implementation specific so ask the server administrator. – Darryl Miles Jun 04 '13 at 11:44
4

I believe the specification you are looking for is RFC 3987, which describes IRIs - Internationalized Resource Identifiers.

Community
  • 1
  • 1
Jim
  • 72,985
  • 14
  • 101
  • 108