5

I've got a javascript application that sends a large amount of numerical data down the wire. This data is then stored in a database. I am having size issues (too much bandwidth, database getting too big). I am now ready to sacrifice some performance for compression.

I was thinking of implementing a base 62 number.toString(62) and parseInt(compressed, 62). This would certainly reduce the size of the data but before I go ahead and do this I thought I would put it to the folks here as I know there must be some outside the box solution I have not considered.

The basic specs are: - Compress large number arrays into strings for JSONP transfer (So I think UTF is out) - Be relatively fast, look I'm not expecting same performance as I have now but I also don't want gzip compression either.

Any ideas would be greatly appreciated.

Thanks

Guido Tapia

Justin Johnson
  • 30,978
  • 7
  • 65
  • 89
gatapia
  • 3,574
  • 4
  • 40
  • 48
  • Why not gzip? What's wrong with that? (or more likely, DEFLATE) You can also do something like just Huffman, or just LZW, if you want to keep it simpler. Huffman coding in Javascript: http://tom-ash.net/blogs/Blog.aspx?File=Programming/20090602_HuffmanCompression.blog – Cheeso Apr 07 '10 at 23:05
  • Just to be clear - were sending data from the client to the server via javascript and ajax/jsonp ? – James Westgate Apr 07 '10 at 23:11
  • @James: Correct we are sending a very large number array as a string to a server. The array is not in JSON (so no silly xml tags - [edit] oops I meant curly braces) just a string.join('.') between the ints. This string is then also stored in a database (which is the real size issue). – gatapia Apr 07 '10 at 23:23
  • @Cheeso:I don't want to put the compression / decompression load on the server. I want the client (browser) to incur this cost so gzip is out unless u can think of a way of a way of using a complex compression algorithm in JS. – gatapia Apr 07 '10 at 23:27
  • I was thinking you would do Huffman or LZ77 or LZW of deflate in Javascript, in the browser. If you compress the data in the browser, there's no way to avoid load on the server, of course. You will need to decompress on the server, after uploading. – Cheeso Apr 08 '10 at 00:45
  • @Cheeso, server does not need data, it justs stores it in the database for future delivery back to a web browser. – gatapia Apr 08 '10 at 03:01
  • @gatapia, just out of interest, what sizes are we talking about here? What's an average sized transmission? How large is the DB? – nickf Apr 08 '10 at 04:20
  • Ah, I see. The server never needs to decompress in that case. – Cheeso Apr 08 '10 at 05:25
  • It makes me wonder if you are creating some kind of map / reduce grid application in javascript ... – James Westgate Apr 08 '10 at 17:40
  • 1
    @nick: We are talking requests of ~200k each. And we are talking ~100,000 requests a day and we keep 2 weeks of data, however that 100,000 requests / day are increasing :) If I can get those requests to 100k it will let me live with my current hardware set up for at least another 6 months. – gatapia Apr 08 '10 at 21:25
  • What's the nature of the data? Does it need to keep order? There are usually better ways to compress strings of numbers than LZW for common cases. – alecco Apr 11 '10 at 19:56
  • Unfortunately there's no Javascript binding for this yet, but [Quantile Compression](https://www.graphallthethings.com/posts/quantile-compression) would be perfect your your use case. It's available in Rust, compresses and decompresses very quickly, and achieves better compression ratio than any other lossless algorithms on numerical data. – mwlon Sep 18 '21 at 15:49

3 Answers3

5

Another way of doing this might be to encode to binary types such as signed/unsigned ints, and manually decode as at http://snippets.dzone.com/posts/show/685 which would require server side code to create the binary data.

You could then huffman compression or something similar like RLE (see http://rosettacode.org/wiki/Run-length_encoding#JavaScript for an implementation, though it may have some issues in IE without modifying) to compress the data further.

EDIT: Alternatively, you could convert the numbers themselves to a base (radix) in the unencoded URI character range (see http://en.wikipedia.org/wiki/Percent-encoding) which should work well if many of the numbers are larger than 2 digits. I converted the code at http://code.activestate.com/recipes/111286-numeric-base-converter-that-accepts-arbitrary-digi/ from python to do this.

It currently doesn't handle floats, but it could be done fairly easily:

function get_map(s) {
    d = {}
    for (var i=0; i<s.length; i++) {
        d[s.charAt(i)] = i}
    d.length = s.length
    d._s = s
    return d}

var separate_with = '~';
var encodable = get_map('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.'); // - is reserved for negatives obviously :-P
var base10 = get_map('0123456789')

// UNCOMMENT ME for length/speed testing in a wider base!
// You may wish to experiment with the ranges for a happy medium between bandwidth and DB space :-P
/*var encodable = ''
for (var i=1; i<128; i++) {
    encodable += String.fromCharCode(i)
}
encodable = get_map(encodable)*/

function baseconvert(number, fromdigits, todigits) {
    var number = String(number)

    if (number.charAt(0) == '-') {
        number = number.slice(1, number.length)
        neg=1}
    else {
        neg=0}

    // make an integer out of the number
    var x = 0
    for (var i=0; i<number.length; i++) {
        var digit = number.charAt(i)
        x = x*fromdigits.length + fromdigits[digit]
    }

    // create the result in base 'todigits.length'
    res = ""
    while (x>0) {
        remainder = x % todigits.length
        res = todigits._s.charAt(remainder) + res
        x = parseInt(x/todigits.length)
    }

    if (neg) res = "-"+res
    return res
}

function encodeNums(L) {
    var r = []
    for (var i=0; i<L.length; i++) {
         r.push(baseconvert(L[i], base10, encodable))
    }
    return r.join(separate_with)
}

function decodeNums(s) {
    var r = []
    var s = s.split(separate_with)
    for (var i=0; i<s.length; i++) {
         r.push(parseInt(baseconvert(s[i], encodable, base10)))
    }
    return r
}

var test = [5, 654645, 24324, 652124, 65, 65289543, 65278432, 643175874158, 652754327543]
alert(encodeNums(test))
alert(decodeNums(encodeNums(test)))
cryo
  • 14,219
  • 4
  • 32
  • 35
  • This is what I proposed in my initial question (base 62 encoding). Your example does however raise the potential for a larger radix (128) however I doubt that that will JSONP transfer correctly, I think 62 is about as good as it will get. Thanks for the code tho, I'll compare to mine for performance. – gatapia Apr 08 '10 at 21:34
  • FYI, This code would create a bunch of global variables. You may want to update it. In get_map: d would be a global variable, in baseconvert: res, neg, and remainder will all create global variables. I stopped reading after that, but the idea seems intersting... – Charlie Feb 28 '16 at 03:27
0

Options

  • Use a js library (see answer from Josh) but watch for script timeouts
  • Use some kind of activex or silverlight uploader that also does zip
  • Use a java plug-in
  • Use a flash based compressing uploader
James Westgate
  • 11,306
  • 8
  • 61
  • 68
0

I'm now toying with the idea of encoding the length of the number into the number itself. I still have not perfected this algorithm but will post it once done. But roughly this is what I am currently trying to achieve:

Boundaries:

  • Loss of precision is allowed (+- 3)
  • Max number will be 3200
  • Min is 0 (so no negatives)
  • No decimal - floats

So now given my max allowed number I know that the length of the encoded digit in base 62 will have a max length of 2. So any encoded number is either 1 or 2 characters in length. Awesome. So now I'm going to make the number odd or even depending if its 1 or 2 characters (remember I can handle loss of precission). This removes the need for separators.

Now I'm seeing about 70%-80% compression with this, its very buggy at the moment but I'm excited about it, so the post to encourage discussion around this methodology.

gatapia
  • 3,574
  • 4
  • 40
  • 48
  • If binary was allowed, I'd use the first of the eight bits in each char for asking whether the number continues as at http://en.wikipedia.org/wiki/Variable-length_code. But reserving 31 chars for "continue" and 31 for "stop" would probably also work. Then again, you've probably already thought of that by the sound of it :-) – cryo Apr 08 '10 at 23:47