1

I have a large int array that I want to persist on the filesystem. My understanding is the best way to store something like this is to use the gob package to convert it to a byte array and then to compress it with gzip. When I need it again, I reverse the process. I am pretty sure I am storing it correctly, however recovering it is failing with EOF. Long story short, I have some example code below that demonstrates the issue. (playground link here https://play.golang.org/p/v4rGGeVkLNh). I am not convinced gob is needed, however reading around it seems that its more efficient to store it as a byte array than an int array, but that may not be true. Thanks!

package main

import (
    "bufio"
    "bytes"
    "compress/gzip"
    "encoding/gob"
    "fmt"
)

func main() {
    arry := []int{1, 2, 3, 4, 5}
    //now gob this
    var indexBuffer bytes.Buffer
    writer := bufio.NewWriter(&indexBuffer)
    encoder := gob.NewEncoder(writer)
    if err := encoder.Encode(arry); err != nil {
        panic(err)
    }
    //now compress it
    var compressionBuffer bytes.Buffer
    compressor := gzip.NewWriter(&compressionBuffer)
    compressor.Write(indexBuffer.Bytes())
    defer compressor.Close()
    //<--- I think all is good until here

    //now decompress it
    buf := bytes.NewBuffer(compressionBuffer.Bytes())
    fmt.Println("byte array before unzipping: ", buf.Bytes())
    if reader, err := gzip.NewReader(buf); err != nil {
        fmt.Println("gzip failed ", err)
        panic(err)
    } else {
        //now ungob it...
        var intArray []int
        decoder := gob.NewDecoder(reader)
        defer reader.Close()
        if err := decoder.Decode(&intArray); err != nil {
            fmt.Println("gob failed ", err)
            panic(err)
        }
        fmt.Println("final int Array content: ", intArray)
    }
}
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
amlwwalker
  • 3,161
  • 4
  • 26
  • 47
  • What is your question, then? Just whether `gob` is appropriate or more efficient? Or do you have some specific problem with your code? – Jonathan Hall Jul 09 '19 at 10:47
  • When I try to recover the initial int array after Gob->compress->expand->Gob I get an EOF - i.e its not restoring to the original int array – amlwwalker Jul 09 '19 at 10:50

1 Answers1

1

You are using bufio.Writer which–as its name implies–buffers bytes written to it. This means if you're using it, you have to flush it to make sure buffered data makes its way to the underlying writer:

writer := bufio.NewWriter(&indexBuffer)
encoder := gob.NewEncoder(writer)
if err := encoder.Encode(arry); err != nil {
    panic(err)
}
if err := writer.Flush(); err != nil {
    panic(err)
}

Although the use of bufio.Writer is completely unnecessary as you're already writing to an in-memory buffer (bytes.Buffer), so just skip that, and write directly to bytes.Buffer (and so you don't even have to flush):

var indexBuffer bytes.Buffer
encoder := gob.NewEncoder(&indexBuffer)
if err := encoder.Encode(arry); err != nil {
    panic(err)
}

The next error is how you close the gzip stream:

defer compressor.Close()

This deferred closing will only happen when the enclosing function (the main() function) returns, not a second earlier. But by that time you already wanted to read the zipped data, but that might still sit in an internal cache of gzip.Writer, and not in compressionBuffer, so you obviously can't read the compressed data from compressionBuffer. Close the gzip stream without using defer:

if err := compressor.Close(); err != nil {
    panic(err)
}

With these changes, you program runs and outputs (try it on the Go Playground):

byte array before unzipping:  [31 139 8 0 0 0 0 0 0 255 226 249 223 200 196 200 244 191 137 129 145 133 129 129 243 127 19 3 43 19 11 27 7 23 32 0 0 255 255 110 125 126 12 23 0 0 0]
final int Array content:  [1 2 3 4 5]

As a side note: buf := bytes.NewBuffer(compressionBuffer.Bytes()) – this buf is also completely unnecessary, you can just start decoding compressionBuffer itself, you can read data from it that was previously written to it.

As you might have noticed, the compressed data is much larger than the initial, compressed data. There are several reasons: both encoding/gob and compress/gzip streams have significant overhead, and they (may) only make input smaller on a larger scale (5 int numbers don't qualify to this).

Please check related question: Efficient Go serialization of struct to disk

For small arrays, you may also consider variable-length encoding, see binary.PutVarint().

icza
  • 389,944
  • 63
  • 907
  • 827
  • excellent thanks alot. Reading about flush now, but will remove the writer from the process. Edit: Is there a way to know when compression becomes valuable? And is using gob necessary to save a []int that is not worth compressing? – amlwwalker Jul 09 '19 at 10:59
  • @amlwwalker When it becomes "valueable": measure. It depends on the input. Some input may be compressed better. The `gob` is not necessary, actually it just ads both computational and spatial overhead. If you still want to compress, just use `encoding/binary` to convert integers to bytes (which you can write to the `gzip` stream). Also note that you should not use `int` as its size depends on the architecture, but instead fix-sized integers like `int32` or `int64`. – icza Jul 09 '19 at 11:16
  • Cool, so getting rid of gob then sounds like a good idea. I think I'm pretty close, its using encoding/binary to convert it back to the []int64 after compression that is failing for me - I get an empty array after reading into the []int64. Any chance you can take a look? https://play.golang.org/p/NRimlw4Udss – amlwwalker Jul 09 '19 at 11:38
  • hmm issue maybe I need to know the size of the array first? gob can handle that? encoding/binary cant? - EDIT - yes this seems to be it. Hmmm. Does that mean I need to persist the length of the original []int64 then or can it be "worked out" on the fly? https://play.golang.org/p/cayXw7fCcK2 – amlwwalker Jul 09 '19 at 11:45
  • @amlwwalker Length of a value of type `[]int64` is not fixed, so you have to persist the length yourself. `gob` takes care of that for you. – icza Jul 09 '19 at 11:50
  • Thats the ticket then. Save overhead by not using gob, and just persist the length. Thanks for all your help – amlwwalker Jul 09 '19 at 11:51