0

I've got a database with rows that are identified by 32-character hexadecimal GUIDs (stored as binary). I'm wondering how to compress these strings dynamically into a shorter, but still user-friendly representation... ideally for use in shared URLs. Since they're 32 characters hexadecimally (and case-insensitive currently) ... I tried hitting the binary representation with base64 encoding. That got them from 32 characters to 22 characters, but I wasn't sure if there was anything better that was common yet straightforward.

I'm also thinking about getting creative, given that even emoji now is technically URL-safe. Not sure if that's a good idea, though.

Has anyone considered cross-platform solutions for this problem before? Is it better just to generate new IDs entirely with a smaller subset?

Ben Guild
  • 4,881
  • 7
  • 34
  • 60
  • Well, you have a hard lower bound of 16 bytes, since each hex digit describes half a byte. 22 characters doesn't seem materially worse to me than 16, so I'd go with base64. – j_random_hacker Jul 08 '15 at 16:09
  • And using emoji strikes me as an *extremely terrible* idea. (1) People still expect URLs to be or at least look like "plain text", and to be able to copy and paste them from plain text files; (2) You're likely to hit all manner of browser bugs and file encoding bugs; (3) You possibly have other users like me who simply find emoji annoying. – j_random_hacker Jul 08 '15 at 16:17
  • I agree emoji sounds terrible, but I was interested in the idea of getting creative with this. :) – Ben Guild Jul 08 '15 at 16:22
  • 1
    possible duplicate of [shortest encoding for Guid for use in a URL](http://stackoverflow.com/questions/1278934/shortest-encoding-for-guid-for-use-in-a-url) – usr Jul 08 '15 at 17:11
  • Emoji would make things much worse, since they really take four bytes each in UTF-8. – Mark Adler Jul 08 '15 at 23:03

2 Answers2

1

You are allowed to use 0-9, a-z, A-Z, and !$'()*+,-._~ in a URI (which does not include the characters with special syntax interpretations). That's 74 characters. That is a little better than 64. You can use a simple scheme to pull 6 or 7 bits from your stream of bits and use that to select one of the allowed URI characters.

To encode, pull six bits from your stream. If it is less than 54, then emit the corresponding character in the set of 74. If it is 54 or more, pull one more bit on the bottom of that. You now have a seven bit number in the range of 108..127. Subtract 108 and add 54 to get the range 54..73. Emit that character from the set.

You now have an average number of bits per character of 6*54/74 + 7*20/74 = 6.27. Or 1.276 characters per byte. Your 16-byte ID would then be encoded, on average, in 20.4 characters. Actually a bit more since you will have to stuff a few zero bits at the end to get the last character out. The real-world average is 21.1303, with a minimum of 19 and a maximum of 22.

This is faster and simpler than trying to do a base conversion with large integers, and gives essentially the same performance, 21 characters.

Do your 16-byte IDs tend to have leading or trailing zeros, or other patterns amendable to compression? If so, then you can arrange the encoding scheme to use fewer characters for those cases.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
0

See this Javascript implementation:

function toDigits(n, b){
    var digits = []
    while(n.isPositive()){
        digits.push(n.remainder(b).valueOf())
        n  = n.quotient(b);
    }
    return digits
}
function fromDigits(digits, b){
    n = BigInteger(0);
    for(var i=0;i<digits.length;i++){
        var d=parseInt(digits[i],b);
        n = n.multiply(b).add(d);
    }
    return n;
}
function changebase(n,from_base,to_base){
    var temp=fromDigits(n,from_base);
    return toDigits(temp,to_base);
}
var unreserved_characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~";
var number_of_unreserved_characters=unreserved_characters.length;

var guid="9ec54806c242982ca059661b6db74ab9";
var newbase=changebase(guid,16,number_of_unreserved_characters);
var newurl="";
for(var i=0;i<newbase.length;i++){
    newurl+=unreserved_characters[newbase[i]];
}

I used a BigInteger library http://silentmatt.com/biginteger/.

This implementation converts the hex into a new base that is the number of unreserved characters allowed in URI. This might be a little better than base64 since it hase 2 additional characters for a total of 66 characters compared to 64 in base64. That might not make much difference though. So depending on whether you dont mind browser compatibility, you can add other ascii characters to the list.

for instance using:

var unreserved_characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø׃áíóúñѪº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐└┴┬├─┼ãÃ╚╔╩╦╠═╬¤ðÐÊËÈıÍÎÏ┘┌█▄¦Ì▀ÓßÔÒõÕµþÞÚÛÙýݯ´≡±‗¾¶§÷¸°¨·¹³²■";

has much more characters and reduces the size even more, and might work on your target browsers.

TK Omble
  • 369
  • 3
  • 8