33

I have a bunch of 10 digit integers that I'm passing in a URL. Something like: "4294965286", "2292964213". They will always be positive and always be 10 digits.

I'd like to compress those integers into the smallest possible form that can still be used in in a URL (aka letters and numbers are perfectly fine) and then uncompress them later. I've looked at using gzipstream but it creates larger strings, not shorter.

I'm currently using asp.net so a vb.net or c# solution would be best.

Thanks

Paul Lemke
  • 5,494
  • 3
  • 47
  • 66

6 Answers6

36

Yes. GZIP is a compression algorithm which both requires compressible data and has an overhead (framing and dictionaries, etc). An encoding algorithm should be used instead.

The "simple" method is to use base-64 encoding.

That is, convert the number (which is represented as base 10 in the string) to the actual series of bytes that represent the number (5 bytes will cover a 10 digit decimal number) and then base-64 that result. Each base-64 character stores 6 bits of information (to the decimals ~3.3 bits/character) and will thus result in a size of approximately just over half (in this case, 6* base-64 output characters are required).

Additionally, since the input/output lengths are obtainable from the data itself, "123" might be originally (before being base-64 encoded) converted as 1 byte, "30000" as 2 bytes, etc. This would be advantageous if not all the numbers are approximately the same length.

Happy coding.


* Using base-64 requires 6 output characters.

Edit: I was wrong initially where I said "2.3 bits/char" for decimal and proposed that less than half the characters were required. I have updated the answer above and show the (should be correct) math here, where lg(n) is log to the base 2.

The number of input bits required to represent the input number is bits/char * chars -> lg(10) * 10 (or just lg(9999999999)) -> ~33.2 bits. Using jball's manipulation to shift the number first, the number of bits required is lg(8999999999) -> ~33.06 bits. However this transformation isn't able to increase the efficiency in this particular case (the number of input bits would need to be reduced to 30 or below to make a difference here).

So we try to find an x (number of characters in base-64 encoding) such that:

lg(64) * x = 33.2 -> 6 * x = 33.2 -> x ~ 5.53. Of course five and a half characters is nonsensical so we choose 6 as the maximum number of characters required to encode a value up to 999999999 in base-64 encoding. This is slightly more than half of the original 10 characters.

However, it should be noted that to obtain only 6 characters in base-64 output requires a non-standard base-64 encoder or a little bit of manipulation (most base-64 encoders only work on whole bytes). This works because out of the original 5 "required bytes" only 34 of the 40 bits are used (the top 6 bits are always 0). It would require 7 base-64 characters to encode all 40 bits.

Here is a modification of the code that Guffa posted in his answer (if you like it, go give him an up-vote) that only requires 6 base-64 characters. Please see other notes in Guffa's answer and Base64 for URL applications as the method below does not use a URL-friendly mapping.

byte[] data = BitConverter.GetBytes(value);
// make data big-endian if needed
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data);
}
// first 5 base-64 character always "A" (as first 30 bits always zero)
// only need to keep the 6 characters (36 bits) at the end 
string base64 = Convert.ToBase64String(data, 0, 8).Substring(5,6);

byte[] data2 = new byte[8];
// add back in all the characters removed during encoding
Convert.FromBase64String("AAAAA" + base64 + "=").CopyTo(data2, 0);
// reverse again from big to little-endian
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data2);
}
long decoded = BitConverter.ToInt64(data2, 0);

Making it "prettier"

Since base-64 has been determined to use 6 characters then any encoding variant that still encodes the input bits into 6 characters will create just as small an output. Using a base-32 encoding won't quite make the cut, as in base-32 encoding 6 character can only store 30 bits of information (lg(32) * 6).

However, the same output size could be achieved with a custom base-48 (or 52/62) encoding. (The advantage of a base 48-62 is that they only requires a subset of alpha-numeric characters and do not need symbols; optionally "ambiguous" symbols like 1 and "I" can be avoided for variants). With a base-48 system the 6 characters can encode ~33.5 bits (lg(48) * 6) of information which is just above the ~33.2 (or ~33.06) bits (lg(10) * 10) required.

Here is a proof-of-concept:

// This does not "pad" values
string Encode(long inp, IEnumerable<char> map) {
    Debug.Assert(inp >= 0, "not implemented for negative numbers");

    var b = map.Count();
    // value -> character
    var toChar = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Index, i => i.Value);
    var res = "";
    if (inp == 0) {
      return "" + toChar[0];
    }
    while (inp > 0) {
      // encoded least-to-most significant
      var val = (int)(inp % b);
      inp = inp / b;
      res += toChar[val];
    }
    return res;
}

long Decode(string encoded, IEnumerable<char> map) {
    var b = map.Count();
    // character -> value
    var toVal = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Value, i => i.Index);      
    long res = 0;
    // go in reverse to mirror encoding
    for (var i = encoded.Length - 1; i >= 0; i--) {
      var ch = encoded[i];
      var val = toVal[ch];
      res = (res * b) + val;
    }
    return res;
}

void Main()
{
    // for a 48-bit base, omits l/L, 1, i/I, o/O, 0
    var map = new char [] {
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K',
        'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
        'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
        'h', 'j', 'k', 'm', 'n', 'p', 'q', 'r', 's', 't',
        'u', 'v', 'x', 'y', 'z', '2', '3', '4',
    };
    var test = new long[] {0, 1, 9999999999, 4294965286, 2292964213, 1000000000};
    foreach (var t in test) {
        var encoded = Encode(t, map);
        var decoded = Decode(encoded, map);
        Console.WriteLine(string.Format("value: {0} encoded: {1}", t, encoded));
        if (t != decoded) {
            throw new Exception("failed for " + t);
        }
    }
}

The result is:

value: 0 encoded: A
value: 1 encoded: B
value: 9999999999 encoded: SrYsNt
value: 4294965286 encoded: ZNGEvT
value: 2292964213 encoded: rHd24J
value: 1000000000 encoded: TrNVzD

The above considers the case where the numbers are "random and opaque"; that is, there is nothing that can be determined about the internals of the number. However, if there is a defined structure (e.g. 7th, 8th, and 9th bits are always zero and 2nd and 15th bits are always the same) then -- if and only if 4 or more bits of information can be eliminated from the input -- only 5 base-64 characters would be required. The added complexities and reliance upon the structure very likely outweigh any marginal gain.

Aristos
  • 66,005
  • 16
  • 114
  • 150
  • 1
    A ten digit integer like 5432154321 can't be represented as four bytes. – Guffa May 05 '11 at 16:54
  • @Guffa Yeah, confused between input shown and requirements. Updated. –  May 05 '11 at 16:56
  • In Encode function if I do inp = inp / b; two times then, what will be the changes in decode function? – waghekapil Dec 30 '14 at 08:28
  • Why are you generating a dictionary in encode? You could just access the array with the index as int directly? I prefered most-to-least format so I will post an answer that is opitmized and most-to-least – Michael P May 15 '20 at 12:04
9

You could use base64 encoding to reduce the data into seven characters. You need five bytes to represent the number, and those can be encoded into eight characters using base64, but that last character is always the filler =, so it can be removed:

long value = 4294965286;

// get the value as an eight byte array (where the last three are zero)
byte[] data = BitConverter.GetBytes(value);
// encode the first five bytes
string base64 = Convert.ToBase64String(data, 0, 5).Substring(0, 7);
Console.WriteLine(base64);

Output:

Jvj//wA

To decode the text, you add the = again, decode it, and read it as a number:

// create an eight byte array
byte[] data = new byte[8];
// decode the text info five bytes and put in the array
Convert.FromBase64String(base64 + "=").CopyTo(data, 0);
// get the value from the array
long value = BitConverter.ToInt64(data, 0);

Console.WriteLine(value);

Output:

4294965286

Two of the characters that base64 uses are not suitable for use in an URL, so you can replace them with other characters, and then replace them back. The + and / characters could for example be replaced by - and _.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
8

I think what you're looking for are Hash IDs: http://hashids.org/

They have implementations in many languages, although it looks like C# is not one of them.

I made an example for you in JavaScript: http://codepen.io/codycraven/pen/MbWwQm

var hashids = new Hashids('my salt', 1, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890');
var input = 4294965286;
var hex = input.toString(16); // 8 characters: fffff826
var hashid = hashids.encode(input); // 7 characters: 0LzaR1Y
var base64 = window.btoa(input).replace(/=+/, ''); // 14 characters: NDI5NDk2NTI4Ng

Note that the HashIDs libraries protect your hashes from including foul language.

Cody Craven
  • 676
  • 1
  • 6
  • 12
  • Thank you. I hand rolled some version of this using a "call-centre friendly" set of characters, only to find I can just use this off the shelf. Coolio. – Igor K Feb 19 '20 at 18:23
4

In addition to changing the base of the encoding (pst and I had the same thought around the same time), since all your numbers are 10 decimal digits, you can subtract the smallest 10 digit number (10E9) from each number before you encode it, and then add that back in after decoding. This will shift your encoded numbers into the range of 0 - 8999999999, thus allowing for smaller strings after the base change.

Community
  • 1
  • 1
jball
  • 24,791
  • 9
  • 70
  • 92
4

What about converting a big number to a formula: So instead of 21312312312 I might use 4^34. Link

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
2

I liked @user166390 answer but I preferred a most-to-least format and thought the code could be improved since the use of dictionary is unnecessary in encode and don't need to be generated on every decode. Also I added an exception and changed to ulong since negative values are not supported.

If somebody has further performance improvements feel free to write. Maybe if there is a better alternative to StringBuilder

Here is the code modified by me.

        public static string EncodeNumber(ulong input)
        {
            return EncodeNumber(input, Mapping85Bit);
        }

        // This does not "pad" values
        private static string EncodeNumber(ulong inp, char[] map)
        {
            // use ulong count instead of int since does not matter on x64 operating system.
            ulong cnt = (ulong)map.Length;
            // value -> character
            if (inp == 0)
            {
                return map[0].ToString();
            }
            var sb = new StringBuilder();
            while (inp > 0)
            {
                // encoded most-to-least significant
                ulong val = inp % cnt;
                inp = inp / cnt;
                sb.Insert(0, map[(int)val]);
            }
            return sb.ToString();
        }

        public static ulong DecodeNumber(string encoded)
        {
            return DecodeNumber(encoded, Mapping85Bit, Mapping85BitDict);
        }

        private static ulong DecodeNumber(string encoded, char[] map, Dictionary<char, ulong> charMapDict)
        {
            // use ulong count instead of int since does not matter on x64 operating system.
            ulong b = (ulong)map.Length;
            ulong res = 0;
            for (var i = 0; i < encoded.Length; i++)
            {
                char ch = encoded[i];
                if(!charMapDict.TryGetValue(ch, out ulong val))
                {
                    throw new ArgumentException($"Invalid encoded number: '{encoded}'. '{ch}' is not a valid character for this encoding.");
                }
                res = (res * b) + val;
            }
            return res;
        }



        // Windows file system reserved characters:     < > : " / \ | = * 

        /// <summary>
        /// Compatible with file system. Originates from ASCII table except starting like Base64Url and except windows path reserved chars. Skipped '/' and '\' to prevent path problems. Skipped ' for sql problems.
        /// https://www.ascii-code.com/
        /// Does not need to be encoded for json since it doesn't use \ and ". No encoding also needed for xml since &lt; &gt; are also not used. That is why it is also different to https://en.wikipedia.org/wiki/Ascii85
        /// </summary>
        public static readonly char[] Mapping85Bit = new char[] {
            'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
            'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
            'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd',
            'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
            'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
            'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7',
            '8', '9', '-', '_', ' ', '!', '#', '$', '%', '&',
            '(', ')', '+', ',', '.', ';', '?', '@', '[', ']',
            '^', '`', '{', '}', '~'
        };
        private static readonly Dictionary<char, ulong> Mapping85BitDict = Mapping85Bit.Select((v, i) => new { Value = v, Index = (ulong)i }).ToDictionary(i => i.Value, i => i.Index);

    [Test]
    public void EncodeTest()
    {
        // 85Bit Encoding:
        Assert.AreEqual(EncodeNumber(85), "BA");
        Assert.AreEqual(EncodeNumber(86), "BB");
        Assert.AreEqual(EncodeNumber(3), "D");
        Assert.AreEqual(EncodeNumber(84), "~");

        Assert.AreEqual(EncodeNumber(0), "A");

        Assert.AreEqual(DecodeNumber("BA"), 85);

        Assert.AreEqual(DecodeNumber("BA"), 85);
        Assert.AreEqual(DecodeNumber("`"), 81);
    }
Michael P
  • 241
  • 3
  • 8