136

We are designing a URL system that will specify application sections as words separated by slashes. Specifically, this is in GWT, so the relevant parts of the URL will be in the hash (which will be interpreted by a controller layer on the client-side):

http://site/gwturl#section1/section2

Some sections may need additional attributes, which we'd like to specify with a :, so that the section parts of the URL are unambiguous. The code would split first on /, then on :, like this:

http://site/gwturl#user:45/comments

Of course, we are doing this for url-friendliness, so we'd like to make sure that none of these characters which will hold special meaning will be url-encoded by browsers, or any other system, and end up with a url like this:

http://site/gwturl#user%3A45/comments <--- BAD

Is using the colon in this way safe (by which I mean won't be automatically encoded) for browsers, bookmarking systems, even Javascript or Java code?

Stevoisiak
  • 23,794
  • 27
  • 122
  • 225
Nicole
  • 32,841
  • 11
  • 75
  • 101
  • Maybe it is a good idea to specify (more clearly) that you use the URLs at client-side only? Since a lot of the answers (as did mine) seem to assume you are going to send the URL to a server using HTTP. – Veger Jan 13 '10 at 00:38
  • Edited to add clarification that use of the fragment is happening on the client-side. – Nicole Jan 13 '10 at 00:52
  • I'm curious: after 10 months, has this url scheme worked for you? I'm considering using the same scheme. – Jonathan Swinney Nov 11 '10 at 16:46
  • 1
    @Jonathan Swinney, Unfortunately I've moved on from this project (and company), although the answers here satisfied me that it is the way to go. If I were to start a new project, I would use this scheme, but I would also be sure to use `#!` to indicate that the pages are stateful - see http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html (This proposal has been adherred to by heavy AJAX users such as Facebook) – Nicole Nov 11 '10 at 17:15
  • 2
    I just found out that WhatsApp will cut a URL on the first colon, so for example it rendered a google maps URL useless. So yes, it's important to escape it. – Petruza Apr 11 '16 at 19:40
  • Why to use colon in url ? is following url valid ? "../video/:videoId" – vikramvi Oct 12 '20 at 12:16

11 Answers11

101

I recently wrote a URL encoder, so this is pretty fresh in my mind.

http://site/gwturl#user:45/comments

All the characters in the fragment part (user:45/comments) are perfectly legal for RFC 3986 URIs.

The relevant parts of the ABNF:

fragment      = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Apart from these restrictions, the fragment part has no defined structure beyond the one your application gives it. The scheme, http, only says that you don't send this part to the server.


EDIT:

D'oh!

Despite my assertions about the URI spec, irreputable provides the correct answer when he points out that the HTML 4 spec restricts element names/identifiers.

Note that identifier rules are changing in HTML 5. URI restrictions will still apply (at time of writing, there are some unresolved issues around HTML 5's use of URIs).

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • I think you are on to something, can you explain this a little further? Not sending this to the server is not an issue, as we are using GWT. I'm just not sure I'm clear on the syntax specified by the section you quoted. – Nicole Jan 13 '10 at 00:19
  • But `:` is a gen-delim, not a sub-delim. – bobince Jan 13 '10 at 00:19
  • 1
    The semi-colon is legal for a pchar, so whether it is in sub-delim or gen-delim is not an issue – Veger Jan 13 '10 at 00:23
  • @bobince - `:` is in `pchar`, which is in `fragment`, so `:` is allowed. @Renesis - Wikipedia has an article on ABNF http://en.wikipedia.org/wiki/ABNF You are basically looking at a list of allowed characters, where `/` means _OR_. I haven't done any GWT programming, so I don't know how it uses the fragment part of URIs. – McDowell Jan 13 '10 at 00:27
  • One last question -- do you have any insight into the real-world application of this specification? Does this mean browsers should/will ignore (skip the encoding of) the `:` in the fragment? – Nicole Jan 13 '10 at 00:43
  • It's important that people realise this is the correct answer; everyone else is saying it isn't valid, but *it is after the '#' symbol, so it is*. – Noon Silk Jan 13 '10 at 01:40
  • @Renesis - I had forgotten about the HTML 4 limitations - see this answer: http://stackoverflow.com/questions/2053132/is-a-colon-safe-for-friendly-url-use/2053640#2053640 – McDowell Jan 17 '10 at 21:30
  • This is an excellent answer. I upvoted it, but still wanted to stop in and let you know how much I like everything about it. – Joshua Cheek May 16 '21 at 01:58
93

MediaWiki and other wiki engines use colons in their URLs to designate namespaces, with apparently no major problems.

eg http://en.wikipedia.org/wiki/Template:Welcome

Paul Wray
  • 939
  • 6
  • 4
  • 59
    Most relevant answer. We all know that what's in the specs has little to do with reality in web development. You're not going to get a much better guarantee of "safety" than "one of the top 10 websites in the world does it". – Steven Collins Dec 14 '14 at 21:48
  • 5
    @StevenCollins No more relevant than the answer given 3-years prior to this one that states exactly the same thing :) – Martin James Apr 12 '19 at 22:21
69

In addition to McDowell's analysis on URI standard, remember also that the fragment must be valid HTML anchor name. According to http://www.w3.org/TR/html4/types.html#type-name

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

So you are in luck. ":" is explicitly allowed. And nobody should "%"-escape it, not only because "%" is illegal char there, but also because fragment must match anchor name char-by-char, therefore no agent should try to tamper with them in any way.

However you have to test it. Web standards are not strictly followed, sometimes the standards are conflicting. For example HTTP/1.1 RFC 2616 does not allow query string in the request URL, while HTML constructs one when submitting a form with GET method. Whichever implemented in the real world wins at the end of the day.

djvg
  • 11,722
  • 5
  • 72
  • 103
irreputable
  • 44,725
  • 9
  • 65
  • 93
10

I wouldn't count on it. It'll likely get url encoded as %3A by many user-agents.

Asaph
  • 159,146
  • 25
  • 197
  • 199
  • 1
    @arbales: Yes. Some less compliant user-agents will leave non-compliant urls unadorned. – Asaph Jan 12 '10 at 23:11
5

From URLEncoder javadoc:

For more information about HTML form encoding, consult the HTML specification.

When encoding a String, the following rules apply:

  • The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
  • The special characters ".", "-", "*", and "_" remain the same.
  • The space character " " is converted into a plus sign "+".
  • All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

That is, : is not safe.

axtavt
  • 239,438
  • 41
  • 511
  • 482
5

Google also uses colons.

In this specification, they use colons for the custom method names.

Pang
  • 9,564
  • 146
  • 81
  • 122
Sabfir
  • 186
  • 4
  • 5
4

I don't see Firefox or IE8 encoding some of the Wikipedia URLs that include the character.

kprobst
  • 16,165
  • 5
  • 32
  • 53
  • 1
    Opera also keeps the semi-colon, but counting on such behavior is not a good thing to do – Veger Jan 12 '10 at 23:13
  • 1
    Renesis is talking about the URL fragment and not the URL path. – Gumbo Jan 12 '10 at 23:16
  • Wikipedia was one of my thoughts when writing this question. Is its use of colons technically invalid/unsafe then? I commonly see ( and ) in Wikipedia URLs encoded, but never the colon, which left me a bit confused. – Nicole Jan 12 '10 at 23:23
  • 3
    The Wayback Machine has a : in many of its links - e.g. http://web.archive.org/web/20080822150704/http://stackoverflow.com/ – barrowc Jan 13 '10 at 01:04
2

Colons are used as the split between username and password if a protocol requires authentication.

JP Silvashy
  • 46,977
  • 48
  • 149
  • 227
0

Colon isn't safe. See here

Bob
  • 5,510
  • 9
  • 48
  • 80
  • That page doesn't motivate why they're not safe. The referenced [RFC2396](http://www.rfc-editor.org/rfc/rfc2396.txt) does not say it should be escaped either. Also, the converter script provided does not encode it (in Chrome 9 anyway). – Adam Lindberg Jan 12 '11 at 14:37
  • Adam you are incorrect. It directly states what and why. – ktamlyn Jun 08 '18 at 14:40
  • Explanation from the article _why_ colon's should be escaped. Seems to be a style argument. > `URLs use some characters for special use in defining their syntax. When these characters are not used in their special role inside a URL, they need to be encoded.` – grmdgs Nov 18 '21 at 22:13
0

Apache URIBuilder and JAX-RS UriBuilder classes treat : differently (they also treat curly braces different)

new URIBuilder("http://localhost").setCustomQuery("foo=a:b&bar={}").buildString()

outputs

http://localhost?foo=a:b&bar=%7B%7D
UriBuilder.fromPath("http://localhost").queryParam("foo", "a:b").queryParam("bar", "{}").toTemplate()

outputs

http://localhost?foo=a%3Ab&bar={}

So Apache URIBuilder does not seem to encode : but it encodes {} and for JAX-RS UriBuilder it is the other way around.

Koray Tugay
  • 22,894
  • 45
  • 188
  • 319
-5

It is not a safe character and is used to distinguish what port you connect to when it is right after your domain name

RHicke
  • 3,494
  • 3
  • 23
  • 23