0

This is a slight variation on how to combine two hashes in that I would like the resulting hash to be influenced more by one of the input.

For the roughly symmetric case, we have algorithms such as boost::hash_combine:

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
  std::hash<T> hasher;
  seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

I am looking for a weighted version, perhaps the interface would resemble:

uint64_t weighted_hash_combine(uint64_t hashA, uint16 weightA, uint64_t hashB, uint16 weightB);

The premise being that the probability of a bit in the output hash being affected by changes in one of the input hashes is a function of the ratio of weightA to weightB.

This would allow me to improve on a tree hashing algorithm for unbalanced trees. A simpler way to hash a tree is covered here, essentially a breadth first traversal pushes each hash(node) into an accumulated value. The problem with this is that the last node to be mixed into the combined hash will have a greater influence over the result than the first.

If a reasonable weighted hash combination is available, then I can bias the combination based on the number of nodes that contributed to each hash and, hopefully, improve the fairness of the hash function.

So far I've come up with:

uint64_t weighted_hash_combine(uint64_t hashA, uint16 weightA, uint64_t hashB, uint16 weightB)
{
  if (weightA > weightB)
  {
    return weighted_hash_combine(hashB,weightB,hashA,weightA);
  }
  uint64_t ratio = weightA / weightB;
  uint64_t combined = hashA;
  for (uint64_t i = 0; i < ratio; i++)
  {
     hash_combine(combined, hashB);
  }
  return combined;
 }       

This is rather lacking in numerical sophistication though, so I'm hoping the community can recall / invent a better solution.

The high level goal is to short-circuit an equality test between trees when the (size or) hash values are different, given that they will often only differ in one or two leaves and there's no good way to estimate which.

Community
  • 1
  • 1
Jon Chesterfield
  • 2,251
  • 1
  • 20
  • 30

1 Answers1

0

Hashes don't work like that. When you properly combine hashes, a change in either hash is guaranteed to change the combined hash, and in fact, by changing either hash, you can completely determine the value of the combined hash.

The most commonly used mixes are variations on :

h = h1*P2 + h2*P1

Where P1 and P2 are different odd primes (or 1). This would be performed mod 2^32 or mod 2^64 depending on the word size, but in either case, you could make h any value you want by choosing either h1 or h2, and this will not go away no matter how many other hashes we mix in like this.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Are you defining a "proper" combination to be a symmetric one? For example, one approach to combining would be to take the first three bits from one 64 bit hash and last 61 bits from the other. That would certainly make the combination more sensitive to one hash than the other. – Jon Chesterfield Jul 04 '16 at 12:59
  • I don't think a change in one input hash is guaranteed to change the combination either. For X = combine(A,B) with X, A, B elements of uint64_t and given A, there will be a set of values of B that lead to the given value of X, because there is a loss of information in converting 128bits back down to 64. – Jon Chesterfield Jul 04 '16 at 13:04
  • "proper" means that it has the qualities I mentioned. When you say "symmetric", you are still thinking that something is lost from either hash when you combine. It doesn't work like that. The combined result is *just as good* as a hash of either input. – Matt Timmermans Jul 04 '16 at 13:06
  • Think of something simpler: h = h1+h2. See how a change in either h1 or h2 is guaranteed to change h? See how you can make any value of h you want, by solving h1 = h2 - h? – Matt Timmermans Jul 04 '16 at 13:08
  • I think we're close to consensus. uint32_t combine(uint32_t,uint32_t) does lose information. We started with 64 bits, we ended with 32. When we solve the equation like that, we get a set of values back, not a single value. – Jon Chesterfield Jul 04 '16 at 13:13
  • Also note that there are NOT multiple h1s that produce the same h, because h1 = h2 - h has only one solution. I'm gonna stop talking now, because you'll get it in a few minutes. And yes it does lose information -- you can't determine BOTH h1 and h2 from h. You can determine EITHER – Matt Timmermans Jul 04 '16 at 13:15