2

I have to make a URL shortener for query strings. Have spent few days trying to compress array data into base64 strings. Thinking that the best approach may be to interpret something like "[[1,2,9,3],[1,0,2],[39,4]]" as base13 with numbers 0-9 and [], symbols.

how the current algorithm works: convert the stringified arrays into an array of base13, where each element represents 1 unique character, convert this array to base10 number, convert this number to base 64 string.

but the problem is when converting the base13 array to base10 number, it makes large numbers like 5.304781188371057e+86 which cant be held in js.

I am open to alternative solutions of course, but please do not suggest something like creating a database of URLs as it won't work as I have up to 51!*51! unique URLs, better to just make a compact encodable and decodable query string and decode it as soon as the website is accessed.

//convert stringified array to array of base13(each element = each digit of base13 number)
function stringToArray(string)
{
    let charSet = "[],1234567890";
    let array = [];
    for(let i = 0; i < string.length; i++)
    {
        array.push(charSet.indexOf(string[i]));
    }
    return array;
}

//convert base13 array to one large decimal number
function arrayToDecimal(array, base)
{
    var decimal = 0;
    for(let i = 0; i < array.length; i++)
    {
        decimal += array[i] * Math.pow(base, i)
    }
    return decimal;
}

//convert decimal number back to array
function decimalToArray(decimal, base)
{
    var quotient = decimal;
    var remainder = [];
    while(quotient > base)
    {
        remainder.push(quotient % base)
        quotient = Math.floor(quotient / base);
    }
    remainder.push(quotient % base)
    return remainder;
}

const alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';

// binary to string lookup table
const b2s = alphabet.split('');

// string to binary lookup table
// 123 == 'z'.charCodeAt(0) + 1
const s2b = new Array(123);
for(let i = 0; i < alphabet.length; i++)
{
    s2b[alphabet.charCodeAt(i)] = i;
}

// number to base64
const ntob = (number) =>
{
    if(number < 0) return `-${ntob(-number)}`;

    let lo = number >>> 0;
    let hi = (number / 4294967296) >>> 0;

    let right = '';
    while(hi > 0)
    {
        right = b2s[0x3f & lo] + right;
        lo >>>= 6;
        lo |= (0x3f & hi) << 26;
        hi >>>= 6;
    }

    let left = '';
    do {
        left = b2s[0x3f & lo] + left;
        lo >>>= 6;
    } while(lo > 0);

    return left + right;
};

// base64 to number
const bton = (base64) =>
{
    let number = 0;
    const sign = base64.charAt(0) === '-' ? 1 : 0;

    for(let i = sign; i < base64.length; i++)
    {
        number = number * 64 + s2b[base64.charCodeAt(i)];
    }

    return sign ? -number : number;
};



console.log(decimalToArray(bton(ntob(arrayToDecimal([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 13))), 13)) 
//encoded and decoded, works output:[1,1,1,1,1,1,1,1,1,1,1,1,1]
console.log(arrayToDecimal([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 13)) 
//encoding doesnt work, array to decimal converts to 5.304781188371057e+86```
Siddharth Agrawal
  • 354
  • 2
  • 3
  • 11

3 Answers3

1

An interesting problem... The first thing you will need to assess is whether the base conversion compression you're seeking is worthwhile. Ie, how many base 64 characters are required to represent n characters of base 13? This involves solving...

13 ** n = 64 ** x

Solving for x, we get...

 x = n * log(13) / log(64)

Ie, for every n digits of base 13, how many digits of base 64 are required. A sampling of a few values of n returns...

  • n = 6, x = 3.70
  • n = 7, x = 4.31
  • n = 8, x = 4.93
  • n = 9, x = 5.55
  • n = 10, x = 6.17
  • n = 11, x = 6.78
  • n = 12, x = 7.40
  • n = 13, x = 8.01
  • n = 14, x = 8.63
  • n = 15, x = 9.25
  • n = 16, x = 9.86

So how to interpret this? If you have 10 digits of base 13, you're going to need 7 digits (6.17 rounded up) of base 64. So the best ratio is when x is equal to, or just under, a whole number. So, 8 digits of base 13 requires 5 digits of base 64, achieving a best case of 5/8 or 62.5% compression ratio.

Assuming that's good enough to meet your requirement, then the following function converts the "base13" string to base 64.

const base13Chars = "0123456789[],";
const base64Chars = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_';  
// see https://en.wikipedia.org/wiki/Query_string for URL parameter allowable characters.

function base13toBase64(x13) {

    base13 = x13.split("").map( c => base13Chars.indexOf(c) );

    // Make the array an even multiple of 8
    for (i = base13.length; i % 8 !==0; i++) {
        base13[i] = 0;
    }

    x64 = "";
    for (i = 0; i < base13.length; i += 8) {
        // Calculate base13 value of the next 8 characters.
        let n = 0;
        for (j = 0; j < 8; j++) {
            n = n * 13 + base13[i + j];
        }
        // Now calculate the base64 of n.
        for (j = 0; j < 5; j++) {
            x64 = x64 + base64Chars.substr(n % 64,1);
            n = Math.floor(n / 64);
        }   
    }

    return x64;
}

Running the above...

 base13toBase64( "[[1,2,9,3],[1,0,2],[39,4]]" ) returns "ilYKerYlgEJ4PxAAjaJi"

Note that the original value is a length of 26 characters, and the base64 value is 20 characters, so the compression ratio is 77%, not quite the optimal 62.5%. This is because of the padding to bring the original array to 32 characters, an even multiple of 8. The longer the string to encode, though, the closer the ratio will be to 62.5%.

Then, on the server side you'll need the constants above plus the following function to "uncompress" the base64 to the base13 stringified URL...

function base64toBase13(x64) {

    base64 = x64.split("").map( c => base64Chars.indexOf(c) );

    x13 = "";
    for (i = 0; i < base64.length; i += 5) {
        // Calculate base64 value of the next 5 characters.
        let n = 0;
        for (j = 5 - 1; 0 <= j; j--) {
            n = n * 64 + base64[i + j];
        }
        // Now calculate the base13 of n.
        let x = "";
        for (j = 0; j < 8; j++) {
            x = base13Chars.substr(n % 13,1) + x;
            n = Math.floor(n / 13);
        }
        x13 = x13 + x;
    }

    // Removed the trailing 0's as a result of the buffering in
    // base13toBase64 to make the array an even multiple of 8.
    while (x13.substr(-1,1) === "0") {
        x13 = x13.substr(0, x13.length - 1);
    }

    return x13;
}

Running the above...

 base64toBase13 ( "ilYKerYlgEJ4PxAAjaJi" ) returns "[[1,2,9,3],[1,0,2],[39,4]]"

Hope this helps...

Trentium
  • 3,419
  • 2
  • 12
  • 19
  • Thanks a bunch! Finally! so let me get this correct, what you did was you converted every 8 characters from base13 string into a decimal number, then converted that into a base64 string? and for base64 back to base13 you went in increments of 5? want to know the exact methodology in case I have to include objects in the future. I also found https://stackoverflow.com/questions/23190056/hex-to-base64-converter-for-javascript which may have been useful in solving this problem too. – Siddharth Agrawal Aug 11 '19 at 07:27
  • That's correct. Eg, 8 digits of base13 represents 13**8 numbers, or 815730721 decimal numbers, specifically numbers 0 thru 815730720 inclusive. Using the same logic, 5 digits of base64 can represent numbers 0 thru 1073741823 inclusive. So the routine converts the base13 to decimal, and then converts the decimal to base64. (BTW, I presume you are aware that if you can perform a POST rather than a GET, there are no restrictions on data size, and you can avoid all this conversion stuff. See https://www.diffen.com/difference/GET-vs-POST-HTTP-Requests ) – Trentium Aug 11 '19 at 13:21
  • not really sure what post and get are, also I don't have any server-side code everything is done locally for my website. again, thanks a bunch for the explanation :) – Siddharth Agrawal Aug 11 '19 at 13:47
1

The best compression is when you can leave stuff out.

assuming your data structure is Array<Array<int>> given by the one sample, we can leave out pretty much everything that doesn't contribute to the data itself.

I'm not compressing the string, but the data itself with 1 b64Character / 5 bits needed to represent a number. as for the structure we only store the number of sub-arrays and their respective lengths; so more or less an additional character per Array in your data.

boils down to:

function encode(data) {
  const alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
  let str = "";

  function encode(nr, hasMoreDigits) {
    if (nr > 31) {
      // I need more bits/characters to encode this number.
      //encode the more significant bits with the 0b100000 flag
      encode(nr >>> 5, 32);
    }

    // 0b011111 payload | 0b100000 flag
    const index = nr & 31 | hasMoreDigits;
    str += alphabet[index];
  }

  encode(data.length);
  data.forEach(arr => {
    encode(arr.length);
    arr.forEach(v => encode(v >>> 0 /* int32 -> uint32 */));
  });

  return str;
}

function decode(str) {
  const alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
  let i = 0;

  function parse() {
    let nr = 0, 
      hasMoreDigits;
    do {
      const index = alphabet.indexOf(str.charAt(i++));
      nr = nr << 5 | index & 31; // 0b011111 payload
      hasMoreDigits = index & 32; // 0b100000 flag
    } while (hasMoreDigits);

    return nr; // int32 due to the bit operations above
  }

  let data = Array(parse());
  for (let j = 0; j < data.length; ++j) {
    let arr = data[j] = Array(parse());
    for (let k = 0; k < arr.length; ++k) {
      arr[k] = parse();
    }
  }
  return data;
}

let data = [
  [1, 2, 9, 3],
  [1, 0, 2],
  [39, 4]
];

let text = encode(data);
let data2 = decode(text);

console.log("input:", data);
console.log("encoded:", text, "length:", text.length);
console.log("output:", data2);
console.log("equal:", JSON.stringify(data) === JSON.stringify(data2));
.as-console-wrapper{top:0;max-height:100%!important}

The encoding of the numbers. Ideally you would encode a number as binary with a static size, but this means 32bit/int which would be 6characters/number, so multibytes.

We split the number into chunks of 'n' bits, ignore the leading zeroes and encode the rest. ideally we can encode small number with very few characters, downside: we loose 1bit/chunk if n is too small and the average numbers are big. It's a tradeoff; that's why I left this configurable.

the current format is 6bits/number. 1 for the Structure, 5 bits as payload. In the format (1.....)*0.....

Thomas
  • 11,958
  • 1
  • 14
  • 23
  • imma have to check this out, seems to be better than the original post, though I have not tested it yet so not sure. either way, Thanks a bunch and I will look deeper into your code in a short while :) – Siddharth Agrawal Aug 11 '19 at 13:21
  • Yes, this is definitely more effective, thanks a lot, though I like both the answers as yours produces smaller URL, and the one from Jon Trent allows me to also include objects in the future if need be, though for now, I will definitely use your code as it produces smaller URLs, Thank you :) – Siddharth Agrawal Aug 11 '19 at 13:50
  • @Siddarth, I will take no offense if you wish to switch the accepted answer to Thomas. Bit twiddling (ie, converting to base2) will almost always provide better character packing! – Trentium Aug 11 '19 at 14:10
  • @JonTrent actually I use base32 numbers in the code. The better compression comes from the facts that I don't need to encode the structure, it is hardcoded in the encoder, that I can encode 0..31 with a single character, not just 0..9, that I can encode `[]` with a single caracter representing the length of the Array and that I "encode" the `,` as a single bit on the number; namely when the character is from the first half of the b64 alphabet, then this is the end of a number, if it is from the other half, then there are more digits to follow. – Thomas Aug 11 '19 at 15:07
  • @SiddharthAgrawal yes this is a very specific implementation tailored specifically to the assumption that the structure of your data is `Array>` with mostly small positive integers. This **is not** and was never intended to be, a general purpose implementation. So yes, it produces smaller results **for this specific format**. – Thomas Aug 11 '19 at 15:10
  • U both are geniuses lol, couldn’t have figured this out on my own thnx a bunch guys! I pretty much don’t even quite get your code Thomas lol. – Siddharth Agrawal Aug 11 '19 at 18:19
  • @SiddharthAgrawal sry, I didn't have time to address your comment earlier. I've replaced the code with something hopefully more understandable. It works exactly as the old one (generates the exact same output). It's simpler because it doesn't have to deal with potentially different length of bits per number anymore; I've hardcoded the bit-mask and the shifts, etc. Hope this helps you understanding my approach. – Thomas Aug 17 '19 at 03:13
0

I would suggest you directly encode the Base13 string into Base64. Although that might not result in better compression than your solution, it removes the heavy multiplications you are performing. Moreover how do you guarantee that no collisions happen when converting through arrayToDecimal ?

Valentin
  • 400
  • 1
  • 4
  • 14