64

I'm looking for a JavaScript function that given a string returns a compressed (shorter) string.

I'm developing a Chrome web application that saves long strings (HTML) to a local database. For testing purposes I tried to zip the file storing the database, and it shrank by a factor of five, so I figured it would help keep the database smaller if I compressed the things it stores.

I've found an implementation of LZSS in JavaScript here: http://code.google.com/p/u-lzss/ ("U-LZSS").

It seemed to work when I tested it "by hand" with short example strings (decode === encode), and it's reasonably fast too, in Chrome. But when given big strings (100 ko) it seems to garble/mix up the last half of the string.

Is it possible that U-LZSS expects short strings and can't deal with larger strings? And would it be possible to adjust some parameters in order to move that upper limit?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Bambax
  • 2,920
  • 6
  • 34
  • 43
  • 2
    Apart from size, are there any other differences between your test cases and your actual data, like encoding, for instance? `u-lzss` seems to only work with UTF-8-encoded strings. – Frédéric Hamidi Dec 31 '10 at 13:29
  • 2
    If that U-LZSS cannot handle long strings it’s simply buggy and incorrect and shouldn’t be used. – Gumbo Dec 31 '10 at 13:30
  • 1
    This seems related - I wouldn't say duplicate, but close enough to do what you need: http://stackoverflow.com/questions/294297/javascript-implementation-of-gzip – Piskvor left the building Dec 31 '10 at 13:37
  • 1
    Apparently, the original author has some problem with putting comments [in the source](http://code.google.com/p/u-lzss/source/browse/trunk/js/lib/ulzss.js?r=18). *sigh* Compression is one of those places where the code can be pretty opaque without a hint as to intent. – T.J. Crowder Dec 31 '10 at 13:57
  • @Piskvor: you're right, it's a very close question; I don't know how I didn't find it before (I really tried!); I will look into the leads there and report here (some time next year... ;-) – Bambax Dec 31 '10 at 16:23
  • @Frédéric Hamidi: I wondered about that, yes, but I don't know how to test it? When one types in the console it's all UTF-8, right? I don't know what happens exactly if I copy a non-UTF-8 string and paste it in the console... it doesn't appear broken when I do... And the pages I'm storing (the actual data) are UTF-8 encoded (are at least supposed to be: they are served as such). – Bambax Dec 31 '10 at 16:28

9 Answers9

52

I just released a small LZW implementation especially tailored for this very purpose as none of the existing implementations did meet my needs.

That's what I'm using going forward, and I will probably try to improve the library at some point.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
pieroxy
  • 999
  • 7
  • 15
  • Is there a PHP library compatible with your LZW inplementation? The one I downloaded returns empty string for data generated in JS. – Tomáš Zato Nov 24 '13 at 23:21
  • I'm not aware of any php implementation. There is a small section on the home page to help you port the lib if you need to: [Porting LZString to another language](http://pieroxy.net/blog/pages/lz-string/index.html#inline_menu_6) – pieroxy Nov 27 '13 at 11:11
  • 1
    Thanks pieroxy, I tried your library but it's not efficient with short strings. Eg. I compress a 250 bytes string, I obtain a 300 bytes output. Any way to deal with this? – davide Dec 16 '14 at 22:58
  • @davide That's strange because this lib is specially tailored for short strings. If I copy/paste your comment for example (200 chars so 400 bytes) and compress it with [the demo page](http://pieroxy.net/blog/pages/lz-string/demo.html) it compresses it to 188 bytes, which is less than 50%! If you keep having trouble, just open an issue on GitHub - it is more suited for this kind of discussion than here. – pieroxy Jan 07 '15 at 06:44
  • 1
    Worked perfectly for what I needed using `compressToBase64()` and `decompressFromBase64()`. – TripeHound Nov 17 '15 at 13:23
  • This is really nice. Using it to compress/decompress fairly large JSON stringified objects into query string params, working great. Thanks for your work! – reads0520 Aug 26 '21 at 16:28
23

It seems, there is a proposal of compression/decompression API: https://github.com/wicg/compression/blob/master/explainer.md .

And it is implemented in Chrome 80 (right now in Beta) according to a blog post at https://blog.chromium.org/2019/12/chrome-80-content-indexing-es-modules.html .

I am not sure I am doing a good conversion between streams and strings, but here is my try to use the new API:

function compress(string, encoding) {
  const byteArray = new TextEncoder().encode(string);
  const cs = new CompressionStream(encoding);
  const writer = cs.writable.getWriter();
  writer.write(byteArray);
  writer.close();
  return new Response(cs.readable).arrayBuffer();
}

function decompress(byteArray, encoding) {
  const cs = new DecompressionStream(encoding);
  const writer = cs.writable.getWriter();
  writer.write(byteArray);
  writer.close();
  return new Response(cs.readable).arrayBuffer().then(function (arrayBuffer) {
    return new TextDecoder().decode(arrayBuffer);
  });
}

const test = "http://www.ScriptCompress.com - Simple Packer/Minify/Compress JavaScript Minify, Fixify & Prettify 75 JS Obfuscators In 1 App 25 JS Compressors (Gzip, Bzip, LZMA, etc) PHP, HTML & JS Packers In 1 App PHP Source Code Packers Text Packer HTML Packer or v2 or v3 or LZW Twitter Compress or More Words DNA & Base64 Packer (freq tool) or v2 JS JavaScript Code Golfer Encode Between Quotes Decode Almost Anything Password Protect Scripts HTML Minifier v2 or Encoder or Escaper CSS Minifier or Compressor v2 SVG Image Shrinker HTML To: SVG or SVGZ (Gzipped) HTML To: PNG or v2 2015 JS Packer v2 v3 Embedded File Generator Extreme Packer or version 2 Our Blog DemoScene JS Packer Basic JS Packer or New Version Asciify JavaScript Escape JavaScript Characters UnPacker Packed JS JavaScript Minify/Uglify Text Splitter/Chunker Twitter, Use More Characters Base64 Drag 'n Drop Redirect URL DataURI Get Words Repeated LZMA Archiver ZIP Read/Extract/Make BEAUTIFIER & CODE FIXER WHAK-A-SCRIPT JAVASCRIPT MANGLER 30 STRING ENCODERS CONVERTERS, ENCRYPTION & ENCODERS 43 Byte 1px GIF Generator Steganography PNG Generator WEB APPS VIA DATAURL OLD VERSION OF WHAK PAKr Fun Text Encrypt Our Google";

async function testCompression(text, encoding = 'deflate') {
  console.log(encoding + ':');
  console.time('compress');
  const compressedData = await compress(text, encoding);
  console.timeEnd('compress');
  console.log('compressed length:', compressedData.byteLength, 'bytes');
  console.time('decompress');
  const decompressedText = await decompress(compressedData, encoding);
  console.timeEnd('decompress');
  console.log('decompressed length:', decompressedText.length, 'characters');
  console.assert(text === decompressedText);
}

(async function () {
  await testCompression(test, 'deflate');
  await testCompression(test, 'gzip');
}());

document.getElementById('go').onclick = function () {
  const s = document.getElementById('string').value;
  testCompression(s, 'gzip');
};
<div>
<label>
String to compress:
<input id="string" />
</label>
</div>
<button id="go">Go</button>
4esn0k
  • 9,789
  • 7
  • 33
  • 40
  • 1
    Great answer! One little nitpick for clarity: the parameter to the CompressionStream/DecompressionStream constructor you named 'encoding' but it's [the compression format](https://developer.mozilla.org/en-US/docs/Web/API/CompressionStream/CompressionStream) - 'encoding' would refer to the UTF8 encoding you're doing with the TextEncoder/TextDecoder. – Arbiter Jun 21 '21 at 17:10
  • @4esn0k thanks for the answer! However, I'm struggling to output the compressed data to a string, i.e. for 'hello world' I should get 'H4sIAAAAAAAACstIzcnJVyjPL8pJAQCFEUoNCwAAAA=='. How do I do that? – Presian Nedyalkov Sep 30 '21 at 16:21
  • 1
    @PresianNedyalkov you need to use a function from https://stackoverflow.com/a/9458996/839199 - _arrayBufferToBase64(compressedData); – 4esn0k Sep 30 '21 at 19:13
  • 1
    Thank you @4esn0k, that worked great! – Presian Nedyalkov Oct 01 '21 at 21:26
  • Thank you. And what about `brotli`? – Mir-Ismaili Apr 02 '22 at 19:17
21

Here are encode (276 bytes, function en) and decode (191 bytes, function de) functions I modded from LZW in a fully working demo. There is no smaller or faster routine available on the internet than what I am giving you here.

function en(c){var x='charCodeAt',b,e={},f=c.split(""),d=[],a=f[0],g=256;for(b=1;b<f.length;b++)c=f[b],null!=e[a+c]?a+=c:(d.push(1<a.length?e[a]:a[x](0)),e[a+c]=g,g++,a=c);d.push(1<a.length?e[a]:a[x](0));for(b=0;b<d.length;b++)d[b]=String.fromCharCode(d[b]);return d.join("")}

function de(b){var a,e={},d=b.split(""),c=f=d[0],g=[c],h=o=256;for(b=1;b<d.length;b++)a=d[b].charCodeAt(0),a=h>a?d[b]:e[a]?e[a]:f+c,g.push(a),c=a.charAt(0),e[o]=f+c,o++,f=a;return g.join("")}

var compressed=en("http://www.ScriptCompress.com - Simple Packer/Minify/Compress JavaScript Minify, Fixify & Prettify 75 JS Obfuscators In 1 App 25 JS Compressors (Gzip, Bzip, LZMA, etc) PHP, HTML & JS Packers In 1 App PHP Source Code Packers Text Packer HTML Packer or v2 or v3 or LZW Twitter Compress or More Words DNA & Base64 Packer (freq tool) or v2 JS JavaScript Code Golfer Encode Between Quotes Decode Almost Anything Password Protect Scripts HTML Minifier v2 or Encoder or Escaper CSS Minifier or Compressor v2 SVG Image Shrinker HTML To: SVG or SVGZ (Gzipped) HTML To: PNG or v2 2015 JS Packer v2 v3 Embedded File Generator Extreme Packer or version 2 Our Blog DemoScene JS Packer Basic JS Packer or New Version Asciify JavaScript Escape JavaScript Characters UnPacker Packed JS JavaScript Minify/Uglify Text Splitter/Chunker Twitter, Use More Characters Base64 Drag 'n Drop Redirect URL DataURI Get Words Repeated LZMA Archiver ZIP Read/Extract/Make BEAUTIFIER & CODE FIXER WHAK-A-SCRIPT JAVASCRIPT MANGLER 30 STRING ENCODERS CONVERTERS, ENCRYPTION & ENCODERS 43 Byte 1px GIF Generator Steganography PNG Generator WEB APPS VIA DATAURL OLD VERSION OF WHAK PAKr Fun Text Encrypt Our Google");
var decompressed=de(compressed);

document.writeln('<hr>'+compressed+'<hr><h1>'+compressed.length+' characters versus original '+decompressed.length+' characters.</h1><hr>'+decompressed+'<hr>');
Dave Brown
  • 923
  • 9
  • 6
  • 1
    does not work - does not correctly recreate the input: http://jsfiddle.net/5gmv74b6/ – vlad_tepesch Mar 06 '20 at 12:04
  • 1
    Much more compressed version (en=264, de=179 bytes): https://gist.github.com/mr5z/d3b653ae9b82bb8c4c2501a06f3931c6 – mr5 Sep 11 '20 at 18:36
  • 1
    you should initialize vars f,o to avoid a ReferenceError: `at Object.de` `Uncaught ReferenceError: f is not defined` `Uncaught ReferenceError: o is not defined` – PartialFlavor_55KP Oct 10 '21 at 16:02
  • 3
    Your code fails with UTF-8 Characters. Just paste a Smiley on the input string, and it gets messed up after decompression. Your lack of support for those is likely the reason your compression ratio is better than mine. – pieroxy Feb 02 '22 at 15:36
8

At Piskvor's suggestion, I tested the code found in an answer to this question: JavaScript implementation of Gzip (top-voted answer: LZW implementation) and found that:

  1. it works
  2. it reduces the size of the database by a factor of two

... which is less than 5 but better than nothing! So I used that.

(I wish I could have accepted an answer by Piskvor but it was only a comment).

Community
  • 1
  • 1
Bambax
  • 2,920
  • 6
  • 34
  • 43
7

To me it doesn't seem reasonable to compress a string using UTF-8 as the destination... It looks like just looking for trouble. I think it would be better to lose some compression and using plain 7-bit ASCII as the destination if over-the-wire size is important.

If the storage limit is based on UTF-16 characters then a large safe subset could be looked for if you care about escaping or UTF-16 compliance or you could just try to use each char as 0..65535 if everything else involved (e.g. databases) don't have problems. Most software layers should have no problems with that (ab)use but note that in UTF-16 range 0xD800-0xDFFF is reserved for a special use (surrogate pairs) so some combinations are formally "encoding errors" and could in theory be stopped or distorted.

In a toy 4 KB JavaScript demo I wrote for fun I used an encoding for the result of compression that stores four binary bytes into five chars chosen from a subset of ASCII of 85 chars that is clean for embedding in a JavaScript string (85^5 is slightly more than (2^8)^4, but still fits in the precision of JavaScript integers). This makes compressed data safe for example for JSON without need of any escaping.

In code the following builds the list of 85 "safe" characters:

let cset = "";
for (let i=35; i<35+85+1; i++) {
    if (i !== 92) cset += String.fromCharCode(i);
}

Then to encode 4 bytes (b0, b1, b2 and b3 each from 0...255) into 5 characters the code is:

// First convert to 0...4294967295
let x = ((b0*256 + b1)*256 + b2)*256 + b3;

// Then convert to base 85
let result = "";
for (let i=0; i<5; i++) {
    let x2 = Math.floor(x / 85);
    result += cset[x - x2*85];
    x = x2;
}

To decode you do the reverse, i.e. compute x from the base-85 number and then extract the 4 base-256 digits (i.e. the bytes).

NOTE: in the torus code I used a slightly different charset, instead of skipping 92 \ I replaced it with 126 ~. For who is interested the full decompression code is

// There are two Huffman-encoded code streams
//    T - single chars (0..127) and sequence lengths (128...255)
//    A - high bits of relative addresses of sequence (0..255)
//
// Expansion algorithm is:
//    1) Read a code X from T
//    2) If it's a char (X < 128) then add to output
//    3) otherwise (X>=128) read sequence address ADDR from stream A (high bits)
//       and from input (low bits) and copy X-128 bytes from ADDR bytes "ago"
//

let Z = 5831; // expanded size
let i = 0, // source ptr
    a = 0, // current bits accumulator
    n = 0; // number of available bits in a

// Read a single bit
let b = function(){
    if (!n) {
        // There are no more bits available in the accumulator, read a new chunk:
        // 5 ASCII escape-safe chars will be transformed in 4 8-bit binary bytes
        // (like BASE64, just a bit more dense)
        a = 0;
        let w = 5;
        while (w--) {
            let y = s.charCodeAt(i+w);          // get next char
            a = a*85 + (y > 125 ? 92 : y) - 35; // extract base-85 "digit" (note, uses ~ instead of \ that requires quoting)
        }
        n = 32; // we got 32 bits in a
        i += 5; // we consumed 5 characters from source
    }
    return (a >> --n) & 1;  // extract a single bit
};

// Read a code of z bits by concatenating bits coming from b()
let v = function(z){
    return (--z ? v(z) : 0)*2+b();
};

// Read an Huffman (sub-)tree: a bit will tell if we need to
// read a two sub-trees or a leaf
let h = function(){
    return b() ? [h(), h()] : v(8);
};

// Read A and T Huffman trees
let A = h(), T = h();

// Extract a code given a node:
//   if the node is an array (intermediate node) then we need to read a bit
//   from the input binary stream to decide which way to go down the tree,
//   if it's a number then we just return the value.
//   `n.map` is truthy for arrays and falsy for numbers.
let d = function(n){
    return n.map ? d(n[b()]) : n;
};

let S="";  // Output

// While we're not done
while(S.length<Z){
    // Extract a code from T
    x = d(T);
    if (x < 128) {
        // This is a single character, copy to output
        S += String.fromCharCode(x);
    } else {
        // This is a sequence of x-128 bytes, get address and copy it
        // Note: high 8 bits are from the Huffman tree A and 8 low bits
        // are instead directly form the bit stream as they're basically
        // noise and there's nothing to gain by trying to compress them.
        S += S.substr(S.length-(d(A)<<8)-v(8), x-128)
    };
}

(note that I dind't test this reformatted/commented version, typos may be present)

Mir-Ismaili
  • 13,974
  • 8
  • 82
  • 100
6502
  • 112,025
  • 15
  • 165
  • 265
  • That looks interesting. Minor nitpick: the function you're using *is* a form of escaping (or rather, escaping is a subset of encoding, and this is encoding all right) - it maps "potentially-problematic" characters to a set of 85 ASCII "probably-safe" characters. – Piskvor left the building Dec 31 '10 at 13:40
  • I'm not sure I understand what you mean. I've chosen to use the first 85 chars from 35 to 126 (skipping 92) so that the resulting compressed data can be simply wrapped in double quotes. Compressed data is almost random and if I for example don't skip 92 and just repeat it then the decoder shortens a bit because of the simplification but still the HTML size gets quite a bit bigger than 4096 bytes and is clearly totally unacceptable :-D ... to say it better I found that escaping compressed data is worse than choosing an encoding that doesn't need escaping. – 6502 Dec 31 '10 at 13:49
  • 2
    Your answer is all well and good, but in JavaScript there is no UTF-8 nor any 7-bit ASCII. Every string is internally encoded in UTF-16, and that's what all client-side databases will store. Note that this is not applicable to the size of a JavaScript file, but just applicable to the size in memory - or in localStorage - taken by a String object. – pieroxy May 11 '13 at 21:01
  • @pieroxy: I thought somehow that the database was using UTF-8 encoding for strings (more compact for ascii and supporting all unicode) so using UTF-16 as destination for compression would be wasting space. If instead the database is storing UTF-16 strings then that is of course the best target. Note that the fact that strings are UTF-16 when in memory in javascript is irrelevant. – 6502 May 12 '13 at 09:35
  • 3
    @6502 The limit in localStorage is defined in terms of characters, not in bytes. So whether it uses UTF-8 or UTF-16 doesn't really matter in the end. You can store 2.5M characters (5M on Firefox) and using the entire UTF-16 space still gives you more data. – pieroxy Jun 13 '13 at 07:47
  • Can you give an example of the method you mean? – dy_ Sep 24 '21 at 15:57
  • 1
    @dy_: I've added some javascript code – 6502 Sep 24 '21 at 18:03
1

I think you should also look into lz-string it's fast a compresses quite well and has some advantages they list on their page:

What about other libraries?

  • some LZW implementations which gives you back arrays of numbers (terribly inefficient to store as tokens take 64bits) and don't support any character above 255.
  • some other LZW implementations which gives you back a string (less terribly inefficient to store but still, all tokens take 16 bits) and don't support any character above 255.
  • an LZMA implementation that is asynchronous and very slow - but hey, it's LZMA, not the implementation that is slow.
  • a GZip implementation not really meant for browsers but meant for node.js, which weighted 70kb (with deflate.js and crc32.js on which it depends).

The reasons why the author created lz-string:

  • Working on mobile I needed something fast.
  • Working with Strings gathered from outside my website, I needed something that can take any kind of string as an input, including any UTF characters above 255.
  • The library not taking 70kb was a definitive plus. Something that produces strings as compact as possible to store in localStorage. So none of the libraries I could find online worked well for my needs.

There are implementations of this lib in other languages, I am currently looking into the python implementation, but the decompression seems to have issues at the moment, but if you stick to JS only it looks really good to me.

Nils Ziehn
  • 4,118
  • 6
  • 26
  • 40
1

Try experimenting with textfiles before implementing anything because I think that the following does not necessarily hold:

so I figured it would help keep the database smaller if I compressed the things it stores.

That's because lossless compression algorithms are pretty good with repeating patterns (e.g whitespace).

cherouvim
  • 31,725
  • 15
  • 104
  • 153
  • 1
    Thanks but I don't understand your answer. The database itself in Chrome is an implementation of Sqlite and does not uses any kind of compression AFAIK. It would be simpler to compress the database file as a whole, but I don't think that's possible from within a Chrome application. So I need to compress the strings before they enter the database. – Bambax Dec 31 '10 at 13:25
  • Keep in mind that in JavaScript all strings are `UTF-16`, meaning every single character weight 16 bits. If you only use 7-bit ASCII characters, that's 9 bits wasted for every character in your string. Using a compression library that occupy smartly the 16-bit space will show a non negligible gain. There are online demos that can be tested for this (see my answer to this question) – pieroxy May 10 '13 at 14:42
1

BWTC32Key uses a BZip-family improvement and Base32768 to get extremely high efficiency, and its optional encryption is AES256-CTR to avoid padding. Anything you want (including strings), can be fed into it and the result will be a very efficient UTF16 string containing the input after heavy compression (and optionally encryption after the compression but before the Base32768.) I ran my 829KiB compendium of homemade Minecraft command block commands from eons ago through BWTC32Key, and I got a 13078 character output string. Minecraft command blocks can go up to 32767 characters, but some older versions of the game only allowed in-game use of strings half that size though by using MCEdit you could hit the 32767 size, though this issue was soon fixed.

Anyway, 829KiB of plain text is far larger than the 32767 limit, but BWTC32Key makes it fit into less than 16K characters. For a more extreme example, the full chemical name of the Titin protein is 189 thousand letters. I can use BWTC32Key to get it down to around 640. Even using ASCII representations higher than 1 byte per character (like UTF16) as input still gives the savings.

cigien
  • 57,834
  • 11
  • 73
  • 112
stgiga
  • 21
  • 4
1

When the proposed CompressionStreams web API hits all browsers, you can run the following without importing any modules. It already works for Node applications.

Compress a string to a byte array

async function compress(inString) { 
  const compressedStream = await new Response(inString).body.pipeThrough(new CompressionStream('gzip'))
  const bytes = await new Response(compressedStream).arrayBuffer()  
  return bytes
}

Decompress a byte array to get the original string

async function decompress(bytes) {
  const decompressedStream = await new Response(bytes).body.pipeThrough(new DecompressionStream('gzip'))
  const outString = await new Response(decompressed).text()
  return outString
}
meedstrom
  • 341
  • 2
  • 7