So we have some characters that are in 'grey' zone and can be but don't have to be encoded.
All characters can be encoded. http://stackoverflow.com/questions
and http://stackoverflow.com/%71%75%65%73%74%69%6F%6E%73
are both identical.
The only time a character cannot be encoded, is if it is being used in a way that has a special meaning with URIs, such as the /
separating path elements.
The only time a character must be encoded, if:
- It is one of those special-meaning characters, and not being used with that special meaning.
- It is one of the reserved characters that may have a special meaning in a particular URI scheme or particular place.
- It has a code point about U+007F.
There are exceptions to the last two though.
In the third case if you use a IRI then you don't encode such characters, which is pretty much the definition of an IRI. You can convert between IRI and URI by doing or undoing that encoding. (Any such characters in the host portion must be punycode encoded though, not URI-encoded).
In the second case it's safe to not encode the character if it isn't used as a delimiter in the context in question. So for example, &
can be left as it is in some URIs but not in HTTP URIs where it is often used as a separator for query data. This though depends upon having particular knowledge of the particular URI scheme. It's also probably just not worth the risk of some other process not realising it's okay.
!
is an example of this. RFC 3986 includes the production:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
And so !
is in the set of characters that can be safe to leave unencoded or not, depending on the scheme in use.
Generally, if you're writing your own encoding code (such as when writing a HttpEncoder
implementation) you're probably better off just always encoding !
, but if you're using an encoder that doesn't encode !
all the time that's probably okay too; certainly in HTTP URIs it shouldn't make any difference.