4

I have raw data in text file format with lot of repetitive tokens (~25%). I would like to know if there's any algorithm which will help: (A) store data in compact form (B) yet, allow at run time to re-constitute the original file.

Any ideas?

More details:

  • the raw data is consumed in a pure html+javascript app, for instant search using regex.
  • data is made of tokens containing (case sensitive) alpha characters, plus few punctuation symbols.
  • tokens are separated by spaces, new lines.

Most promising Algorithm so far: Succinct data structures discussed below, but reconstituting looks difficult.

http://stevehanov.ca/blog/index.php?id=120

http://ejohn.org/blog/dictionary-lookups-in-javascript/

http://ejohn.org/blog/revised-javascript-dictionary-search/

PS: server side gzip is being employed right now, but its only a transport layer optimization, and doesn't help maximize use of offline storage for example. Given the massive 25% repetitiveness, it should be possible to store in a more compact way, isn't it?

Algo
  • 41
  • 2
  • 1
    what about using serverside gzip compression? – Parzifal May 14 '12 at 15:53
  • You can use [deflate/inflate](http://stackoverflow.com/questions/2233062/javascript-deflate-implementation), which should be perfect for that kind of task. – user123444555621 May 14 '12 at 16:00
  • How would the data be consumed? Will it be loaded from the server using an AJAX request? – Ja͢ck May 14 '12 at 16:00
  • I am not really sure what you are trying to active but some of the JavaScript obfuscating are meant to compress the data. If you are in mood for adventure you could try implement LZ compression algorithm in JavaScript. – Bakudan May 14 '12 at 16:03
  • ack! apologize for not giving details on actual use. It's to allow realtime search using regex. gzip is enabled, and helps immensely. I have already played with jsunzip, encoding/decoding data from .png to a canvas. I have a nagging feeling though that there's an algorithm suited exactly for this (compact representation + reconstituition on the fly) requirement. – Algo May 14 '12 at 16:26
  • The point of succinct data structures is that you don't have to reconstitute it. They help find a way to use it in compressed form. – Steve Hanov Nov 05 '14 at 19:35

2 Answers2

1

Given that the actual use is pretty unclear I have no idea whether this is helpful or not, but for smallest total size (html+javascript+data) some people came up with the idea of storing text data in a greyscale .png file, one byte to each pixel. A small loader script can then draw the .png to a canvas, read it pixel for pixel and reassemble the original data this way. This gives you deflate compression without having to implement it in Javascript. See e.g. here for more detailled information.

Please, do not use a technique like that unless you have pretty esotheric requirements, e.g. for a size-constrained programming competition. Your coworkers will thank you :-)

Medo42
  • 3,821
  • 1
  • 21
  • 37
  • Don't do this, even with esoteric requirements. The time spent unpacking data (6s for 250k) is not only potentially longer than transferring the data uncompressed, but implementing *any* kind of compression in JavaScript is simply silly since JavaScript is single-threaded. – josh3736 May 14 '12 at 16:38
  • ack! apologize for not giving details on actual use. It's to allow realtime search using regex. gzip is enabled, and helps immensely. I have already played with jsunzip, encoding/decoding data from .png to a canvas etc. I have a nagging feeling though that there's an algorithm suited exactly for this (compact representation + reconstituition on the fly) requirement. Any ideas? – Algo May 14 '12 at 16:45
  • @josh3736 For some uses this is perfectly acceptable, though the one I mentioned the only one I know of - if you want to get as much as possible into 4kb for a competition, anything goes :P Now that we know more about the problem, I agree that this idea is useless here. – Medo42 May 14 '12 at 17:47
1

Generally speaking, it's a bad idea to try to implement compression in JavaScript. Compression is the exact type of work that JS is the worst at: CPU-intensive calculations.

Remember that JS is single-threaded1, so for the entire time spent decompressing data, you block the browser UI. In contrast, HTTP gzipped content is decompressed by the browser asynchronously.

Given that you have to reconstruct the entire dataset (so as to test every record against a regex), I doubt the Succinct Trie will work for you. To be honest, I doubt you'll get much better compression than the native gzipping.


1 - Web Workers notwithstanding.

josh3736
  • 139,160
  • 33
  • 216
  • 263
  • agree about decompression & javascript's single-threadedness being poor partners. That's why was unimpressed with jsunzip. one BIG motivator here: gzip only gives transport level optimization. Once raw data reaches the client side, size balloons back up to original size. Think about offline storage.. you are trying to maximize it. Atleast conceptually, the entropy of this data (i.e. this much amount of repetition) seems to suggest we can store it in a much more compact format. or does it? – Algo May 14 '12 at 18:26