2

I have a very large tree directory of gzipped files which I need to calculate the uncompressed size. As I'm talking of more than 600GB (compressed), I believe that uncompressing each file to verify the size isn't the right approach.

On Unix shell, I easily achieve this task by using the command gzip -l, listing each file on a folder with compression ratio, compressed and uncompressed size.

Although, the Java libraries I found, related to GZIP, are only Streams for compression and decompression.

If the gzip command can retrieve this information without touching the file, I assume that this data must be specified on some sort of header on the file. What would be the way to access this information without decompressing the file?

Mike G
  • 4,232
  • 9
  • 40
  • 66
Netto
  • 284
  • 2
  • 10

2 Answers2

3

According to the GZIP spec RFC 1952 the last 4 bytes of a GZIP block is the uncompressed size of the data. This value is stored in little endian. Most gzipped files are only 1 block so that would be the last 4 bytes of a file.

For example, I just gzipped a file whose uncompressed size was 29963246 bytes. The last 4 bytes in the gzip file are

EE 33 C9 01

which when read little endian (right to left) 0x1C933EE = 29963246

Here's a quick and dirty way to get the size of the uncompressed file by only reading the last 4 bytes in little endian:

File f = ...
try(RandomAccessFile ra =new RandomAccessFile(f, "r");
    FileChannel channel = ra.getChannel()){

        MappedByteBuffer fileBuffer = channel.map(MapMode.READ_ONLY, f.length()-4, 4);
        fileBuffer.load();
        
        ByteBuffer buf = ByteBuffer.allocate(4);
        buf.order(ByteOrder.LITTLE_ENDIAN);
        
        
        buf.put(fileBuffer);
        buf.flip();
        //will print the uncompressed size
        //getInt() reads the 4 bytes as a int
        // if the file is between 2GB and 4GB
        // then this will return a negative value
        //and you'll have to do your own converting to an unsigned int
        System.out.println(buf.getInt());
    }

EDIT

Note this only works for a gzipped file of only 1 zipped block (which is most files < 4GB). If you have a file with multiple gzipped blocks, this will only return the size of the last block. Since the spec only allots 4 bytes for the size, I assume a file >4GB will be split into multiple GZIP blocks.

A more robust version would be to parse each gzip block to get the uncompressed size of each block. The GZIP header also has the size of the compressed data so you would have to parse each GZIP block header, get the length of the compressed data, seek that length to get the end of the GZIP block,then get the uncompressed size to sum up. then keep parsing any additional GZIP blocks until you reach EOF.

Community
  • 1
  • 1
dkatzel
  • 31,188
  • 3
  • 63
  • 67
  • that is ... actually interesting - once again : documentation / spec FTW! Thanks! – specializt Jan 07 '15 at 19:11
  • Wow. That retrieved me the exactly same value as the `gzip -l` command, for the uncompressed size. Isn't it strange that there aren't (or at least, an easy to find) API for those operations? Anyways, thank you very much for your answer. – Netto Jan 07 '15 at 19:26
  • @Netto no problem. I imagine `gzip -l` does what I describe in my last paragraph – dkatzel Jan 07 '15 at 19:27
  • If your files are larger than 2GB, then last call to getInt() will overflow and if your file is larger than 4GB, then I would think the zipped file will be in multiple blocks and you will have to do the more complicated parsing I describe at the end of my post – dkatzel Jan 07 '15 at 19:43
  • I imagined somethink like that when I saw the int value for the size in bytes. Do you havr any relevant tip for identifying the gzip block inside the file? – Netto Jan 07 '15 at 20:56
  • A GZIP encoded file should be all GZIP blocks. The RFC explains the size of each field in the header and what they are for. You can also look at the source for `java.util.zip.GZIPInputStream` to see how it does it. That code reads most of the fields but you don't need to. Also be sure to check out `readTrailer()` which looks for additional blocks. This was added in Java 7 I think... – dkatzel Jan 07 '15 at 21:02
0

Look at Apache Commons Compress, it has support for gzip. It also has a class 'org.apache.commons.compress.compressors.gzip.GzipParameters' that might be of help.

MJSG
  • 1,035
  • 8
  • 12
  • That was very interesting. I was able to access the `GzipParameters` instance of my GZIP file. Unfortunately, it seems to only have the **parametry for the compressor**. The most useful method I found in this class was `GzipParameters.getCompressionLevel()` but it retrieves `-1` for my file. Thanks anyway. – Netto Jan 07 '15 at 19:20