I want to store web pages in compressed text files (CSV). To achieve the optimal compression, I would like to provide a set of 1000 web pages. The library should then spend some time creating the optimal "dictionary" for this content. One obvious "dictionary" entry could be <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
, which could get stored as %1 or something like that because it is present on almost all web pages. By creating a customized dictionary like this, the compression rates should be 99% in my case.
My question is, does a library for doing this exist on Windows with MIT or similar liberal licensing exist? If not, are there any general purpose compression libaries you would recommend. I have tried a bit with zlib, but it outputs binary data. If I would convert this binary data into text, I am worried that the result might be longer than the original text.
EDIT: I need to be able to store the text in CSV files and still be able to import them into a database or even Excel.