3

I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.

Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).

stan
  • 4,885
  • 5
  • 49
  • 72

3 Answers3

3

The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.

So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.

twalberg
  • 59,951
  • 11
  • 89
  • 84
  • Correct. Or at least you need to uncompress up to the desired file. If you're going to be doing a lot of random access on the same .tar.gz file, then you can decompress the whole thing once and build an index of entry points. Then you can get closer to true random access, where the speed depends on the density of entry points. – Mark Adler Oct 05 '12 at 19:09
  • @MarkAdler Of course, there's also the caveat that a `tar` file can contain multiple files with the same name (but not necessarily the same contents). While this is certainly unusual to see in the wild these days, the functionality is still there, and so stopping decompression at the first instance of the file you're looking for is not always the "right thing". – twalberg Oct 05 '12 at 19:23
1

I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.

Additionally, does "random" reading mean that it first uncompresses the entire archive

If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.

The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).

As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.

Kai Sellgren
  • 27,954
  • 10
  • 75
  • 87
  • Zip, RAR and Tar files have an central table of content and for the first two formats each file is compressed individually. But the OP was asking for tar.gz files which are compressed at once which means that you usually have to decompress the whole archive before accessing individual files. – Robert Oct 05 '12 at 15:18
  • @Robert Actually, `tar` does not have a central directory/index. But, each file header has enough information in it to figure out where to skip to in order to get the next file header. – twalberg Oct 05 '12 at 15:22
0

The source code comment of zran describes how such tools are working: http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c

In conclusion one can say that the complete file has to be processed for generating the necessary index. That is much faster than actually decompress everything. The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.

Robert
  • 39,162
  • 17
  • 99
  • 152