I have raw data in text file format with lot of repetitive tokens (~25%). I would like to know if there's any algorithm which will help: (A) store data in compact form (B) yet, allow at run time to re-constitute the original file.
Any ideas?
More details:
- the raw data is consumed in a pure html+javascript app, for instant search using regex.
- data is made of tokens containing (case sensitive) alpha characters, plus few punctuation symbols.
- tokens are separated by spaces, new lines.
Most promising Algorithm so far: Succinct data structures discussed below, but reconstituting looks difficult.
http://stevehanov.ca/blog/index.php?id=120
http://ejohn.org/blog/dictionary-lookups-in-javascript/
http://ejohn.org/blog/revised-javascript-dictionary-search/
PS: server side gzip is being employed right now, but its only a transport layer optimization, and doesn't help maximize use of offline storage for example. Given the massive 25% repetitiveness, it should be possible to store in a more compact way, isn't it?