6

Is there a way to find out the size of the original file which is inside a GZIP file in java?

As in, I have a file a.txt of 15 MB which has been GZipped to a.gz of size 3MB. I want to know the size of a.txt present inside a.gz, without unzipping a.gz.

manil
  • 159
  • 2
  • 9
  • I doubt you can without unzipping it first. – Steven Mar 15 '12 at 06:44
  • the thing is, file sizes may go upto 30 GB.. so unzipping it would take up lot of space.. – manil Mar 15 '12 at 06:49
  • If the file is under 2Gb, you can read the last few bytes of the file, and the uncompressed size is there - if more, not sure there is a reliable way (I've been looking for a way for a while too...) – Nim Mar 15 '12 at 07:03
  • ah, i see... I didnt know about the 2 GB thing, so thank ou for that :) – manil Mar 15 '12 at 07:05
  • What more detail do you want? My answer is all there is to it. – Mark Adler Jul 09 '12 at 13:49

4 Answers4

27

There is no truly reliable way, other than gunzipping the stream. You do not need to save the result of the decompression, so you can determine the size by simply reading and decoding the entire file without taking up space with the decompressed result.

There is an unreliable way to determine the uncompressed size, which is to look at the last four bytes of the gzip file, which is the uncompressed length of that entry modulo 232 in little endian order.

It is unreliable because a) the uncompressed data may be longer than 232 bytes, and b) the gzip file may consist of multiple gzip streams, in which case you would find the length of only the last of those streams.

If you are in control of the source of the gzip files, you know that they consist of single gzip streams, and you know that they are less than 232 bytes uncompressed, then and only then can you use those last four bytes with confidence.

pigz (which can be found at http://zlib.net/pigz/ ) can do it both ways. pigz -l will give you the unreliable length very quickly. pigz -lt will decode the entire input and give you the reliable lengths.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 2
    Thanks to You I've just discovered that `gzip -l bigfile.gz` actually uses an unreliable way too O_o (and thus reports wrong compression ratio on large files as well). – Vlad Oct 25 '13 at 18:21
  • For single files encoded once, of a certain size (under 2^32 bytes), the RFC gives you the way to pull the last 4-bytes to get the file size. Perhaps not completely general, but still very useful. – ChuckCottrill Sep 25 '14 at 01:59
5

Below is one approach for this problem - certainly not the best approach, however since Java doesn't provide an API method for this (unlike that when dealing with Zip files), it's the only way I could think of, apart from one of the comments above, which talked about reading in the last 4 bytes (assuming the file is under 2Gb in size).

GZIPInputStream zis = new GZIPInputStream(new FileInputStream(new File("myFile.gz")));
long size = 0;

while (zis.available() > 0)
{
  byte[] buf = new byte[1024];
  int read = zis.read(buf);
  if (read > 0) size += read;
}

System.out.println("File Size: " + size + "bytes");
zis.close();

As you can see, the gzip file is read in, and the number of bytes read in is totalled indicating the uncompressed file size.

While this method does work, I really cannot recommend using it for very large files, as this may take several seconds. (unless time is not really too much of a constraint)

Crollster
  • 2,751
  • 2
  • 23
  • 34
2
public class ReadStream {

    public static void main(String[] args) {
        try {
            RandomAccessFile raf = new RandomAccessFile(new File("D:/temp/temp.gz"), "r");
            try {
                raf.seek(raf.length() - 4);

                int b4 = raf.read();
                int b3 = raf.read();
                int b2 = raf.read();
                int b1 = raf.read();
                int val = (b1 << 24) | (b2 << 16) + (b3 << 8) + b4;

                System.out.println(val);

                raf.close();
            } catch (IOException ex) {
                Logger.getLogger(ReadStream.class.getName()).log(Level.SEVERE, null, ex);
            }
        } catch (FileNotFoundException ex) {
            Logger.getLogger(ReadStream.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
}
Crazenezz
  • 3,416
  • 6
  • 31
  • 59
  • 3
    As noted in my answer, this is an unreliable way to get the uncompressed length. It's fine for printing out for human consumption, but if a program is depending on getting the correct length, then this method is not guaranteed to work. – Mark Adler Jul 09 '12 at 19:44
  • @MarkAdler Thanks for the information, I just googling and found this way and try it and it works. My bad don't know about the drawback. – Crazenezz Jul 10 '12 at 00:41
  • @MarkAdler: My bad didn't understand your point in your answer above. Just got the point. – Crazenezz Jul 10 '12 at 08:05
0

GZIP doesn't let you know about the size of the contents in advance. These are the ways of managing it that I can think of depending on your requirements:

  1. unzip the stream on the fly and either abort if it is too large
  2. unzip the stream but without writing out the content. This will get
  3. the size of the uncompressed data without taking up any space. It only costs the processing to read and inflate
  4. switch to using zip files (which have entries that can tell you the length in advance)
  5. if you know the type of data you are typically receiving, you may be able to statistically estimate the size based on the size of the compressed gzip.
Paul Jowett
  • 6,513
  • 2
  • 24
  • 19