2

I'm currently working on an application in go that needs to cache different resources. Different types of resources have handlers that will know what data is relevant to determine, if we have to rebuild a resource or if we can fetch it from cache. For this purpose handlers shall generate a hash of all relevant data for caching. Depending on the context that data can be primitives (int, float, ...), strings, slices, structs and maps. So almost everything. The number of objects used for hashing might also vary.

To calculate that hash in handlers, I've created a hashing function with variadic parameters of type interface{}.

My current approach is this:

func Hash(objs ...interface{})([]byte) {
    // Use MD5 because it's fast and is reasonably enough protected against accidental collisions.
    // There is no scenario here where intentional created collisions could do harm.
    digester := crypto.MD5.New()

    encoder := gob.NewEncoder(digester)
    encoder.Encode(objs) // In real life one would handle that error

    return digester.Sum(make([]byte, 0))
}

This works. But there are a few things that give me headaches with this implementation. Because I'm not sure that gob will always behave deterministic, for the current version that seems to be the case, but as the referenced answer points out, there might be changes between versions. According to the documentation for gob, default values (0 for ints, empty strings, nil, ...) will be omitted when transporting structs. Also all int values will be transmitted as generic numbers. So unit64 and int will be the same. I can't think of an actual issue with this for my usecase, but this smells like a source for trouble.

Now if I would write that function from scratch I would properly play it safe, traverse the structure with reflection and create a hashing tree. But I don't want to do that.

I'm pretty sure I'm not the first one out there with these requirements, but I wasn't able to find any well tested go code on the web, that addresses this issue.

Appendix

Also see: https://crypto.stackexchange.com/questions/10058/how-to-hash-a-list-of-multiple-items

This is not as trivial as it may seem. Simply concatenating data as pointed out by Adrian won't work, because then Hash("12", "3") and Hash("123") will generate the same hash. One possible solution is to prepend the length of data (and the number of elements of lists) before hashing or to create a tree of hashes, but I can't think of a reliable way to do either of that with complex data structures, without writing a lot of reflection code that handles all the different cases (integers, floats, strings, structs, slices, ...) and walks through the entire structure. I want to avoid that, because there are a lot of special cases that can be overseen doing so and it seems unnecessary complicated to me. This is why I tried to solve the problem using serialization, which will take care of most of the problems described above. I'm just not sure if there are not some drawbacks of using gob in this scenario and if there isn't a smarter solution to the problem.

Gellweiler
  • 751
  • 1
  • 12
  • 25

3 Answers3

5

Perhaps to add to add to Adrian's answer. If you add fmt.Fprintf(digester, reflect.TypeOf(ob)) before each object. That would make Hash("12", "3") and Hash("123") different.

package main

import (
    "crypto"
    _ "crypto/md5"
    "fmt"
    "reflect"
)

func main() {
    fmt.Printf("%x\n", Hash("12", "3"))
    fmt.Printf("%x\n", Hash("123"))
}

func Hash(objs ...interface{}) []byte {
    digester := crypto.MD5.New()
    for _, ob := range objs {
        fmt.Fprint(digester, reflect.TypeOf(ob))
        fmt.Fprint(digester, ob)
    }
    return digester.Sum(nil)
}

Output:

c3d5dcf1d7540d3e46e7d7b5a8c6e8ae
787ca7e12a2fa991cea5051a64b49d0c

https://play.golang.org/p/nufD3wTJkb

k1m190r
  • 1,213
  • 15
  • 26
1

Hash providers implement io.Writer, meaning you can just write data to them as necessary - you don't have to do it all in a single call to Sum:

for _,ob := range objs {
    fmt.Fprint(digester, ob)
}
return digester.Sum(nil)

Working playground example: https://play.golang.org/p/HtObhrmoaP

If you need to support non-primitive types, you can use something like:

fmt.Fprintf(digester, "%+v", ob)

Or even just:

json.NewEncoder(digester).Encode(&objs)

Like gob, the output of json may change between Go versions, but it seems much less likely as JSON is a very stable format with a very stable implementation.

Adrian
  • 42,911
  • 6
  • 107
  • 99
  • This won't work because Hash("12", "3") and Hash("123") will create the same hash (play.golang.org/p/EXcMl_lTru). This is why I was using gob in the first place. I've clarified the question. – Gellweiler Nov 07 '17 at 17:53
  • With JSON there might be whitespace, special characters in strings can be escaped using backspace+hex or backspace+special. Non ascii characters might get escaped or not, in floats e can be written in lowercase or uppercase, ... All that doesn't matter when using it as storage format, but as a source for hashing that's a big problem. – Gellweiler Nov 07 '17 at 18:04
  • It doesn't matter as a source files hashing because Go encodes JSON consistently. – Adrian Nov 07 '17 at 18:14
  • Yeah but that might change between versions, just as the gob serializer. If I have to choose between one of them I'm going to go with gob as it is closer to actual representation of data in go and also faster. – Gellweiler Nov 07 '17 at 18:25
  • Which I mentioned in my answer, and noted that it's less likely to change than gob. – Adrian Nov 07 '17 at 18:26
  • Turns out gob doesn't support maps and overall is quite limited. So I'm sticking with JSON for now. I'm still not really happy with this. But I suppose it should be fine. So +1 – Gellweiler Nov 13 '17 at 02:09
  • JSON Encoder even sorts maps by key before serializing them. So seems to be okay. – Gellweiler Nov 14 '17 at 22:33
0

It's not going to solve the problem with serialization, but hashing multiple byte slices is not hard at all.

It's called combining hashes.

One practical way to combine hashes is by XORing them, but you may also choose to use another function (e.g. hash function over hash values). Just make sure any hash value you produce will keep being uniformly distributed.

Here's full example:

package main

import (
    "crypto/md5"
    "fmt"
    "hash"
)

func main() {
    fmt.Printf("%x\n", HashArray(md5.New, []byte("12"), []byte("3")))
    fmt.Printf("%x\n", HashArray(md5.New, []byte("123")))
}

func HashArray(newH func() hash.Hash, values ...[]byte) []byte {
    h := newH()
    result := h.Sum(nil)
    for _, v := range values {
        next := HashValue(newH, v)
        // XORing to combine hashes
        for i, _ := range result {
            result[i] ^= next[i]
        }
    }
    return result
}

func HashValue(newH func() hash.Hash, v []byte) []byte {
    h := newH()
    h.Write(v)
    return h.Sum(nil)
}

Produces:

fadc9070abb527a36b97268885a09f9d
f43135bb2359b55f7fcb0e8dc1db090e

(same first letter is a coincidence)

astef
  • 8,575
  • 4
  • 56
  • 95