Is there any archive format that supports both efficient compression of many small entries and direct access to individual entries?

Question

Here's my use case:

I want to store many small entries of about 1K into archive files of about 8M.
I want to be able to read individual entries efficiently (without reading the whole file).
I want to be able to compress the archive efficiently. In the test I performed, a TAR+ZIP archive was 4x smaller than just a ZIP archive. This isn't surprising at all, there's not much opportunity to compress individual 1K entries.
I don't need to update the archive. Once created, it's immutable.

Is there any archive format which supports both (global compression + individual access)? Theoretically, the two goals are not mutually exclusive.

Note: This is for a Java project, so I am restricted to a format that also has a java library.

Duplicate of [this question](http://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives) — Javier, Feb 19 '13 at 18:21
It's not a duplicate. I'm not talking about random access into the compressed stream but rather access to individual entries. I'll rephrase the question to eliminate this confusion. — Bogdan Calmac, Feb 19 '13 at 18:42

score 2 · Accepted Answer · answered Feb 19 '13 at 19:18

I am not aware of a canned solution for your problem, so you may need to write it yourself.

It certainly can be done. I would use the tar format, since it is simple and well understood, but it would require an auxiliary file with index information into the compressed archive. What you would do would be to control the compression of the tar file to create entry points that require no history. Those entry points would need to be much farther apart than 1K to enable good compression, but they can be close enough together to provide relatively fast random access to the 1K files.

The simplest approach would be to use gzip to separately compress chunks of the tar file representing sets of complete files that together are around 128K bytes. The gzip streams can be simply concatenated, and the resulting .tar.gz file would work normally with tar utilities. It is a property of the gzip format that valid gzip streams concatenated are also valid gzip streams.

The auxiliary file would contain a list of the files in the tar archive, their size and offset in the uncompressed tar file, and then separately the compressed and uncompressed offsets of each gzip stream starting point. Then when extracting a file you would look for its offset in the uncompressed tar file, find the gzip stream with the largest uncompressed offset less than or equal to that file's offset, and then start decompressing from the corresponding compressed offset until you get to that file.

For this example, on average you would only need to decompress 64K to get to any given file in the archive.

score 0 · Answer 2 · answered Feb 19 '13 at 18:22

0

In general the built compression table is interspersed with compressed data refering to it.

If one wants to do compression oneself, one way would be:

[sharedcompression table]...

[compression table addition specific to file 1] [file 1]
 ,,          ,,             ,,       ,, ,,   2   ,,   2
...

And at the end shuffle/share compression table parts.

Whether one would gain against 7zip, bzip and others is the question.

BTW java zip handling (still?) does not use the optional file index at file end.

answered Feb 19 '13 at 18:22

Joop Eggen

107,315
7
83
138

The compression table doesn't have to be interspersed. It could be accumulated in memory as entries are added and then written down at the end of the archive. Then when you need to read just file1, you read the compression table as well as the section of the archive that contains file1. – Bogdan Calmac Feb 19 '13 at 18:32
1

In terms of compression efficiency, I archived 341 files totaling 276K in two ways: TAR followed by ZIP (therefore doing global compression) and just ZIP (compression of individual files). The TAR+ZIP archive was 4x smaller than the ZIP archive: 52K vs 202K. – Bogdan Calmac Feb 19 '13 at 18:37
I would assume that you also imply that no such archive format already exists. Pity, it would be quite a neat thing. – Bogdan Calmac Feb 19 '13 at 18:52

Is there any archive format that supports both efficient compression of many small entries and direct access to individual entries?

2 Answers2