1

I want to store web pages in compressed text files (CSV). To achieve the optimal compression, I would like to provide a set of 1000 web pages. The library should then spend some time creating the optimal "dictionary" for this content. One obvious "dictionary" entry could be <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">, which could get stored as %1 or something like that because it is present on almost all web pages. By creating a customized dictionary like this, the compression rates should be 99% in my case.

My question is, does a library for doing this exist on Windows with MIT or similar liberal licensing exist? If not, are there any general purpose compression libaries you would recommend. I have tried a bit with zlib, but it outputs binary data. If I would convert this binary data into text, I am worried that the result might be longer than the original text.

EDIT: I need to be able to store the text in CSV files and still be able to import them into a database or even Excel.

David
  • 4,786
  • 11
  • 52
  • 80
  • 1
    what's the programming language? Google for Huffman Compression Library. Have a look at [libhuffman](http://huffman.sourceforge.net/) – sled Mar 07 '11 at 13:19
  • I am looking for a DLL so I guess it should be written in C++ or similar. – David Mar 07 '11 at 13:30
  • I have Googled around, without finding such a DLL other than for libraries for educational purposes. – David Mar 07 '11 at 19:06
  • how fine-grained do you want the compression? in other words, if you put all 1000 web pages in a single CSV file, and you want to pull the last byte of the last page from it, is it OK to start decompressing from the first byte in the first file and go through to the end, or do you not have enough time to do that, and so need to start decompressing from the start of the last file, or perhaps the last line in the last file? – David Cary Mar 08 '11 at 01:15

1 Answers1

4
  1. "text files (not binary)" is a little too general. If you mean that some byte values (00,1A or whatever) can't be used, then any binary method + something like base64 coding can be used. (Although I'd suggest a more efficient method from Coroutine demo source).

    To be specific, you can use any general-purpose compressor to compress your base file, then base file + target file, then diff these, and you'd get a dictionary compression (binary), which can be then converted to "text" with base64 or yenc or whatever.

    Alternatively, there're some coders with build-in support for that, for example
    http://compression.ru/ds/ppmtrain.rar
    http://code.google.com/p/lzham/

  2. If you actually want to have common phrases replaced with references, and all other things left untouched (what is kinda implied, but not equals to "text output"), you can use text preprocessors like:
    http://xwrt.sourceforge.net/
    http://compression.ru/ds/liptify.rar (There were more afair).

  3. Also a hybrid method is possible. You can use a general-purpose LZ compressor like in [1], for example lzma, then replace its entropy coding with something text-based. For example, in http://nishi.dreamhosters.com/u/lzmarec_v1_bin.rar there's an utility which removes LZMA's entropy coding, and its pretty easy to convert its output to text.

Community
  • 1
  • 1
Shelwien
  • 2,160
  • 15
  • 17
  • Thank you for your excellent answer. To clearify I need to be able to store the text in CSV files and still be able to import them into a database or even Excel. This means that some columns in the CSV file might be compressed some won't. I hope this clearifies enough. – David Mar 07 '11 at 23:00
  • Then you need to find which symbols you can't use in CSV, and add the rest to the init string in http://nishi.dreamhosters.com/u/marc_v1.rar , then use any normal compression library. – Shelwien Mar 08 '11 at 00:39
  • Yes, any arbitrary binary [compression algorithm](http://en.wikibooks.org/wiki/Data_Compression/Refereneces#open-source_example_code) (say, zlib) and any arbitrary [binary-to-text encoding](http://en.wikipedia.org/wiki/binary-to-text_encoding) (say, base64 encoding or basE91 encoding) sounds like it might meet your criteria ... – David Cary Mar 08 '11 at 01:11