7

I am looking for a url encoding method that is most efficient in terms of space. Raw binary (base2) could be represented in base16 which is smaller and is url safe, but base64 is even more efficient. However, the usual base64 encoding isn't url safe....

So what is the smallest encoding method that is also safe for URLS?

rook
  • 66,304
  • 38
  • 162
  • 239

3 Answers3

4

This is what the Base64 URL encoding variant is for.

It uses the same standard Base64 Alphabet except that + is changed to - and / is changed to _.

Most modern Base64 implementations will support this alternate encoding. If yours doesn't, it's usually just a matter of doing a search/replace on the Base64 input prior to decoding, or on the output prior to sending it to a browser.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
2

"base66" (theoretical, according to spec)

As far as I can tell, the optimal encoding for URLs is a "base66" encoding into the following alphabet:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789-_.~

These are all the "Unreserved characters" according the URI specification RFC 3986 (section 2.3), so they will appear as-is in the URL. Using this "base66" encoding could give a URL like:

https://example.org/articles/.3Ja~jkWe

The question is then if you want . and ~ in your URLs?

On some older servers (ancient by now, I guess) ~joe would mean the "www directory" of the user joe on this server. And thus a user might be confused as to what the ~ character is doing in the middle of your URL. This is common for academic websites, especially CS professors (e.g. Donald Knuth's website https://www-cs-faculty.stanford.edu/~knuth/)

"base80" (in practice, but not battle-tested)

However, in my own testing the following 14 other symbols also do not get percent-encoded (in Chrome 95 and Firefox 93):

!$'()*+,:;=@[]

(see also this StackOverflow answer)

leaving a "base80" URL encoding possible. Some of these (notably + and =) would not work in the query string part of the URL, only in the path part. All in all, this ends up giving you beautiful, hyper-compressed URLs like:

https://example.org/articles/1OWG,HmpkySCbBy@RG6_,
https://example.org/articles/21Cq-b6Ud)txMEW$,hc4K
https://example.org/articles/:3Tx**U9X'd;tl~rR]q+

There's a plethora of reasons why you might not want all of those symbols in your URLs. One example is that StackOverflow's own "linkifier" won't include that ending comma in the link it generates (I've manually made it a part of the link here).

Also the percent encoding seems to be quite finicky. In some cases Firefox would initially percent-encode ' and ~] but on later requests would not.

qff
  • 5,524
  • 3
  • 37
  • 62
1

You can use a 62 character representation instead of the usual base 64. This will give you URLs like the youtube ones: http://www.youtube.com/watch?v=0JD55e5h5JM

You can use the PHP functions provided in this page if you need to map strings to a database numerical ID:

http://bsd-noobz.com/blog/how-to-create-url-shortening-service-using-simple-php

Or this one if you need to directly convert a numerical ID to a short URL string: http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/

Tchoupi
  • 14,560
  • 5
  • 37
  • 71