1

I'm writing out small, variable size frames (from 15 to maybe 4k bytes each) of real time data to a file. The total size of the data can go into the tens of gigabytes, so I want to compress it.

But when reading from the file, I want to be able to seek inside the data without having to decompress everything (up to the point of interest). It would be great if there was a way to start decompressing at entry points spaced at e.g. 1MB intervals, that I could jump to and read what the timestamp of the next frame in the compressed data is, and start decompressing without having to decompress everything from the start up to that point.

But I do not want to implement a whole compression algorithm for this. If the resulting file also turned out to be compatible with a widely used format like gzip, that would be great, but ease of implementation (and not degrading the compression ratio too much) is more important.

When trying to maintain compatibility with gzip, it could be done by using multiple gzip members, about 1MB size each, and putting the next timestamp information into a (per-member) extra field. The downside, if I'm not mistaken, is that the dictionary information is discarded at the start of every member. (Though I don't know whether copying the dictionary or starting over for every member consumes less bytes / cpu cycles.)

There are various solutions that address the "random read access" issue in Java and other languages, but I could not find one that runs on the CLR.

One further requirement is that compression must be performed streamingly, i.e. I cannot write an index at the end when compression is finished. So the entry points need to either be predefined or the information to get to the entry points be intertwined with the compressed data, so that I can use what has already been written to disk even if the process is killed or crashes.

Neither .net's GZipStream nor SharpZipLib provide hooks out of the box to help.

Ideas?

Community
  • 1
  • 1
Evgeniy Berezovsky
  • 18,571
  • 13
  • 82
  • 156
  • Given the size of the frames, I think it would be better to just use filesystem compression on a folder. – leppie Mar 04 '15 at 06:13
  • @leppie The size of the frames should be irrelevant to the solution. Relevant is the interval of the entry points, which I tentatively set at 1MB. But more importantly, the result is supposed to be a compressed file, i.e. something that I can copy to other machines, send via email etc. and which retains the compression. – Evgeniy Berezovsky Mar 04 '15 at 06:20

2 Answers2

1

You have already found plenty of good approaches out there. A sequence of gzip members is a perfectly fine solution, where each has an extra field that has the length of that member so that you can skip members. There is some small compression loss, but that cannot be avoided if you want to be able to start decompressing at specified points in the stream without the previous decompressed data. You can reduce that impact by making the members larger.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • It would also mean I'd have to buffer the complete (compressed) member so I know the size to write it in the extra field. For .net, it also means extending an existing library so that I can write and read extra fields, which is a route I might take in the end. – Evgeniy Berezovsky Mar 04 '15 at 06:57
  • Yes. But you're talking about small members. Are you really so constrained that you can't have a few MB of buffer for the compressed data? – Mark Adler Mar 04 '15 at 06:58
0

In order to implement custom GZIP features that GZipStream is lacking (it did not even decompress multiple gzip members, when I tried, although gunzip does), it is actually not necessary to reimplement the compression algorithm. GZipStream uses DeflateStream and just adds headers and CRC, so I just need to implement my own GZipStream, using DeflateStream to do the compression for me, which seems straighforward.

Thanks Mark Adler for confirming my speculation on the gzip spec. So the solution will be both straighforward, and the result compatible to the gzip specification.

Evgeniy Berezovsky
  • 18,571
  • 13
  • 82
  • 156