-2

I am trying to compute the sha256 sum of a gzipped file in Go, but my output does not match that of the gzip command.

I have a function Compress that gzips the contents of an io.Reader, a file in my case.

func Compress(r io.Reader) (io.Reader, error) {
    var buf bytes.Buffer
    zw := gzip.NewWriter(&buf)
    if _, err := io.Copy(zw, r); err != nil {
        return nil, err
    }
    if err := zw.Close(); err != nil {
        return nil, err
    }
    return &buf, nil
}

Then I have a function Sum256 that computes the sha256 sum of a reader.

func Sum256(r io.Reader) (sum []byte, err error) {
    h := sha256.New()
    if _, err := io.Copy(h, r); err != nil {
        return nil, err
    }
    return h.Sum(nil), nil
}

My main function opens a file, gzips it, then computes the sha256 sum of the zipped contents. The problem is that the output does not match that of the gzip command. The input file hello.txt contains a single line with the word hello with no newline at the end.

func main() {
    uncompressed, err := os.Open("hello.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer uncompressed.Close()

    sum, err := Sum256(uncompressed)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("%x  %s\n", sum, uncompressed.Name())

    uncompressed.Seek(0, 0)
    compressed, err := Compress(uncompressed)
    if err != nil {
        log.Fatal(err)
    }

    sum, err = Sum256(compressed)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("%x  %s.gz\n", sum, uncompressed.Name())
}

gzip results:

$ sha256sum hello.txt
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824  hello.txt

$ gzip -c hello.txt | sha256sum
809d7f11e97291d06189e82ca09a1a0a4a66a3c85a24ac7ff389ae6fbe02bcce  -

$ gzip -nc hello.txt | sha256sum
f901eda57fd86d4239806fd4b76f64036c1c20711267a7bc776ab2aa45069b2a  -

My program results:

$ go run main.go
# match
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824  hello.txt
# mismatch
3429ae8bc6346f1e4fb67b7d788f85f4637e726a725cf4b66c521903d0ab3b07  hello.txt.gz

Any idea why the outputs don't match or on how to fix this? I have tried using an io.Pipe, ioutil.TempFile file, and other methods with the same issue.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
mwalto7
  • 307
  • 6
  • 19
  • 5
    Gzip is not a unique/single/deterministic compression. Different algorithm may produce different compressed output. It is perfectly fine for different gzip algorithm to produce different output for the same input. – Volker Dec 14 '18 at 21:03
  • 5
    Gzip is not required to produce a particular output for a particular input. It only has to produce an output that is compatible with gzip in order to be decompressed. – Adrian Dec 14 '18 at 21:03
  • 1
    Possible duplicate of [Compressed output differs from Go to Ruby Implementation](https://stackoverflow.com/questions/52767214/compressed-output-differs-from-go-to-ruby-implementation) – Steffen Ullrich Dec 15 '18 at 03:45

1 Answers1

5

In particular, note that if you run the command:

gzip -c hello.txt

The output will contain the filename, hello.txt. You can see this with hexdump:

$ touch hello.txt; gzip -c hello.txt | hexdump -C
00000000  1f 8b 08 08 ad 1b 14 5c  00 03 68 65 6c 6c 6f 2e  |.......\..hello.|
00000010  74 78 74 00 03 00 00 00  00 00 00 00 00 00        |txt...........|
0000001e

If you just copy data into a Gzip stream in your program, the filename won't be there. So you must get different results, and the SHA-256 sum should be different.

However, even if you fix this particular defect... you are still not guaranteed to get the same results by running Gzip on the same file.

If you want the checksum to be the same, run the checksum on the decompressed data instead.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415