Why do gzip of Java and Go get different results?

Question

Firstly, my Java version:

string str = "helloworld";
ByteArrayOutputStream localByteArrayOutputStream = new ByteArrayOutputStream(str.length());
GZIPOutputStream localGZIPOutputStream = new GZIPOutputStream(localByteArrayOutputStream);
localGZIPOutputStream.write(str.getBytes("UTF-8"));
localGZIPOutputStream.close();
localByteArrayOutputStream.close();
for(int i = 0;i < localByteArrayOutputStream.toByteArray().length;i ++){
    System.out.println(localByteArrayOutputStream.toByteArray()[i]);
}

and output is:

31 -117 8 0 0 0 0 0 0 0 -53 72 -51 -55 -55 47 -49 47 -54 73 1 0 -83 32 -21 -7 10 0 0 0

Then the Go version:

var gzBf bytes.Buffer
gzSizeBf := bufio.NewWriterSize(&gzBf, len(str))
gz := gzip.NewWriter(gzSizeBf)
gz.Write([]byte(str))
gz.Flush()
gz.Close()
gzSizeBf.Flush()
GB := (&gzBf).Bytes()
for i := 0; i < len(GB); i++ {
    fmt.Println(GB[i])
}

output:

31 139 8 0 0 9 110 136 0 255 202 72 205 201 201 47 207 47 202 73 1 0 0 0 255 255 1 0 0 255 255 173 32 235 249 10 0 0 0

Why?

I thought it might be caused by different byte reading methods of those two languages at first. But I noticed that 0 can never convert to 9. And the sizes of []byte are different.

Have I written wrong code? Is there any way to make my Go program get the same output as the Java program?

Thanks!

Compression is not often considered a deterministic operation, while (lossless) compression+decompression should of course always return the same data. The problem here could be as simple as the default compression level chosen by the different implementations. — Jonathon Reinhart, Mar 12 '15 at 06:00
Does it matter that Java and Go produce different outputs? Because Gzip is lossless, what matters is that the process is always reversible: decompression should give the original data without *any* error or discrepancy. This should not depend on whether the compressor was written in any particular language. — Rick-777, Mar 12 '15 at 09:20

icza · Answer 1 · 2015-03-12T16:13:07.143

First thing is that the byte type in Java is signed, it has a range of -128..127, while in Go byte is an alias of uint8 and has a range of 0..255. So if you want to compare the results, you have to shift negative Java values by 256 (add 256).

Tip: To display a Java byte value in an unsigned fashion, use: byteValue & 0xff which converts it to int using the 8 bits of the byte as the lowest 8 bits in the int. Or better: display both results in hex form so you don't have to care about sign-ness...

Even if you do the shift, you will still see different results. That might be due to different default compression level in the different languages. Note that although the default compression level is 6 in both Java and Go, this is not specified and different implementations are allowed to choose different values, and it might also change in future releases.

And even if the compression level would be the same, you might still encounter differences because gzip is based on LZ77 and Huffman coding which uses a tree built on frequency (probability) to decide the output codes and if different input characters or bit patterns have the same frequency, assigned codes might vary between them, and moreover multiple output bit patterns might have the same length and therefore a different one might be chosen.

If you want the same output, the only way would be (see notes below!) to use the 0 compression level (not to compress at all). In Go use the compression level gzip.NoCompression and in Java use the Deflater.NO_COPMRESSION.

Java:

GZIPOutputStream gzip = new GZIPOutputStream(localByteArrayOutputStream) {
    {
        def.setLevel(Deflater.NO_COMPRESSION);
    }
};

Go:

gz, err := gzip.NewWriterLevel(gzSizeBf, gzip.NoCompression)

But I wouldn't worry about the different outputs. Gzip is a standard, even if outputs are not the same, you will still be able to decompress the output with any gzip decoders whichever was used to compress the data, and the decoded data will be exactly the same.

Here are the simplified, extended versions:

Not that it matters, but your codes are unneccessarily complex. You could simplify them like this (these versions also include setting 0 compression level and converting negative Java byte values):

Java version:

ByteArrayOutputStream buf = new ByteArrayOutputStream();
GZIPOutputStream gz = new GZIPOutputStream(buf) {
    { def.setLevel(Deflater.NO_COMPRESSION); }
};
gz.write("helloworld".getBytes("UTF-8"));
gz.close();
for (byte b : buf.toByteArray())
    System.out.print((b & 0xff) + " ");

Go version:

var buf bytes.Buffer
gz, _ := gzip.NewWriterLevel(&buf, gzip.NoCompression)
gz.Write([]byte("helloworld"))
gz.Close()
fmt.Println(buf.Bytes())

NOTES:

The gzip format allows some extra fields (headers) to be included in the output.

In Go these are represented by the gzip.Header type:

type Header struct {
    Comment string    // comment
    Extra   []byte    // "extra data"
    ModTime time.Time // modification time
    Name    string    // file name
    OS      byte      // operating system type
}

And it is accessible via the Writer.Header struct field. Go sets and inserts them, while Java does not (leaves header fields zero). So even if you set compression level to 0 in both languages, the output will not be the same (but the "compressed" data will match in both outputs).

Unfortunately the standard Java does not provide a way/interface to set/add these fields, and Go does not make it optional to fill the Header fields in the output, so you will not be able to generate exact outputs.

An option would be to use a 3rd party GZip library for Java which supports setting these fields. Apache Commons Compress is such an example, it contains a GzipCompressorOutputStream class which has a constructor which allows a GzipParameters instance to be passed. This GzipParameters is the equvivalent of the gzip.Header structure. Only using this would you be able to generate exact output.

But as mentioned, generating exact output has no real-life value.

score 9 · Accepted Answer · edited Oct 07 '21 at 08:14

From RFC 1952, the GZip file header is structured as:

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+

Looking at the output you've provided, we have:

                          |    Java |          Go
ID1                       |      31 |          31
ID2                       |     139 |         139
CM (compression method)   |       8 |           8
FLG (flags)               |       0 |           0
MTIME (modification time) | 0 0 0 0 | 0 9 110 136
XFL (extra flags)         |       0 |           0
OS (operating system)     |       0 |         255

So we can see that Go is setting the modification time field of the header, and setting the operating system to 255 (unknown) rather than 0 (FAT file system). In other respects they indicate that the file is compressed in the same way.

In general these sorts of differences are harmless. If you want to determine if two compressed files are the same, then you should really compare the decompressed versions of the files though.

Why do gzip of Java and Go get different results?

2 Answers2

Linked

Related