3

The gzip file format contains the (uncompressed/original) file size encoded in the last 4 bytes of the compressed file. The "gzip -l" command reports the compressed and uncompressed sizes, the compression ratio, the original filename.

Looking around stackoverflow, there are a couple of mentions of decoding the size encoded in the last 4 bytes.

What is the encoding of the size? Big-endian (most significant byte first), Little-endian (least significant byte first), and is the value signed or unsigned?

This code snippet seems to be working for me,

FILE* fh; //assume file handle opened
unsigned char szbuf[4];
struct stat statbuf;
fstat(fn,&statbuf);
unsigned long clen=statbuf.st_size;
fseek(fh,clen-4,SEEK_SET);
int count=fread(szbuf,1,4,fh);
unsigned long ulen = ((((((szbuf[4-1] << 8) | szbuf[3-1]) << 8) | szbuf[2-1]) << 8) | szbuf[1-1]);

Here are a couple of related posts, which seem to imply little-endian, and unsigned long (0..4GB-1).

Determine uncompressed size of GZIP file

GZIPOutputStream not updating Gzip size bytes

Determine size of file in gzip

Gzip.org has more information about Gzip

Community
  • 1
  • 1
ChuckCottrill
  • 4,360
  • 2
  • 24
  • 42
  • See [this answer](http://stackoverflow.com/a/9727599/1180620) for why that length should in general not be relied upon. – Mark Adler Sep 25 '14 at 00:51
  • Agreed. For single files encoded once, of a certain size (under 2^32 bytes), the RFC gives you the way to pull the last 4-bytes to get the file size. Perhaps not completely general, but still very useful. – ChuckCottrill Sep 25 '14 at 01:59

1 Answers1

6

RFC says it's modulo 2^32 which means uint32_t, and experimentation using a .Net GZipStream gives it as little-endian.

RFC 1952

Community
  • 1
  • 1
Medinoc
  • 6,577
  • 20
  • 42
  • I added the RFC link. – Medinoc Sep 24 '14 at 22:24
  • 2
    Your experimental results are confirmed in section 2.1 of RFC 1952: "All multi-byte numbers in the format described here are stored with the least-significant byte first (at the lower memory address)." – indiv Sep 24 '14 at 22:27
  • As you have pointed out, the RFC (section 2.1) specifies byte order as least significant byte to most significant byte. Thus the ISIZE 4-byte file size is stored little-endian (as experimental results have confirmed). – ChuckCottrill Sep 25 '14 at 00:33