I'm writing out small, variable size frames (from 15 to maybe 4k bytes each) of real time data to a file. The total size of the data can go into the tens of gigabytes, so I want to compress it.
But when reading from the file, I want to be able to seek inside the data without having to decompress everything (up to the point of interest). It would be great if there was a way to start decompressing at entry points spaced at e.g. 1MB intervals, that I could jump to and read what the timestamp of the next frame in the compressed data is, and start decompressing without having to decompress everything from the start up to that point.
But I do not want to implement a whole compression algorithm for this. If the resulting file also turned out to be compatible with a widely used format like gzip, that would be great, but ease of implementation (and not degrading the compression ratio too much) is more important.
When trying to maintain compatibility with gzip, it could be done by using multiple gzip members, about 1MB size each, and putting the next timestamp information into a (per-member) extra field. The downside, if I'm not mistaken, is that the dictionary information is discarded at the start of every member. (Though I don't know whether copying the dictionary or starting over for every member consumes less bytes / cpu cycles.)
There are various solutions that address the "random read access" issue in Java and other languages, but I could not find one that runs on the CLR.
One further requirement is that compression must be performed streamingly, i.e. I cannot write an index at the end when compression is finished. So the entry points need to either be predefined or the information to get to the entry points be intertwined with the compressed data, so that I can use what has already been written to disk even if the process is killed or crashes.
Neither .net's GZipStream
nor SharpZipLib
provide hooks out of the box to help.
Ideas?