3

When learning data structures, specifically hashtables, we were told that inventing an effective hash function to a datatype is a very hard task, but it was suggested that a quick shortcut exists. Namely, if we can assume that objects do not move around in memory, and we can define object equality as having the same memory address (use reference equality as opposed to value equality), then we can obtain the hashcode of an object like this:

#include<iostream>
template<typename T>
class hashcoder {
private:
    union impl {
        T* ptr;
        int32_t hashcode; // or int64_t on a 64-bit architecture = architecture with 64-bit pointers
    };
public:
    static int32_t hash(const T& r) {
        impl i;
        i.ptr = &r;
        return i.hashcode;
    }
};

class myclass {
  // whatever
};

int main() {
    myclass m;
    std::cout << hashcoder<myclass>::hash(m) << std::endl;
}

So my question is:

  • Is there anything wrong with using memory address for hashcode (again assuming reference equality is the desired behaviour)?
  • Given that using unions for conversion is undefined behaviour, how can we convert a memory address to an integer?
  • (Feel free to point out any other errors I've made in the above code. C++ pointers are dangerously easy to get wrong.)
Community
  • 1
  • 1
marczellm
  • 1,224
  • 2
  • 18
  • 42

4 Answers4

4
  • No, nothing wrong with that; the hash is an integer and is guaranteed to be unique for each object, which reduces the probability of a collision.
  • Cast the pointer to uintptr_t. No need for a union. Also, uintptr_t has the right size for the platform so you no longer need to muck with int32_t etc.

    uintptr_t hash(const T &r)
    {
        return uintptr_t(&r);
    }
    

(If the hash must be 32 bits, either cast this to uint32_t or, on a 64-bit platform, combine the two halves using the appropriate magic.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
2

First, your code assumes that distinct instances of T are always different (for != operator). This is not always the case, and is certainly false for e.g. std::string when you want to hash on the string value (i.e. content), not on some address.... Likewise, if you want to hash (mathematical) vectors of integers, you should hash on their content (the mathematical components of the vector).

Then you should be aware that just using the address of something as its hash is probably suboptimal, since addresses of large enough objects tend to be a multiple of 8 or 16 bytes (precisely, of the alignment of that object type), and often addresses of objects allocated near the same moment are quite similar. In fact some "middle bits" of the pointer are probably more "random" than lower bits or very high bits.

A slightly better approach might be to do some bitwise arithmetic on the pointer address, e.g.

static inline int32_t ptrhash(const void*p) {
   uintptr_t u = (uintptr_t)p;
   return u ^ (u >> 10);
}

or

static inline int32_t ptrhash(const void*p) {
   uintptr_t u = (uintptr_t)p;
   return (u * 200771) + (u % 300823);
}

Both 200771 and 300823 are prime numbers. you could replace the + with a bitwise xor ^

or whatever trick is mixing appropriately the bits of the address.

Of course, YMMV and it is absolutely system specific (e.g. depends if your system has ASLR)

Another approach, for some class T, might be to generate -in the constructor, e.g. at construction time of instances of T, some random hash (using a quick PRNG like lrand48 ...), and put that hash as private instance member variable. Or simply use some static counter, to number uniquely each instance, and use that number as an hash.

The important point is to be sure that all your instances are different! If you want to hash on content, this is not the case (but see memoization, etc...).

Also, I would disagree with your teacher: indeed, inventing a very good hash function is hard, but making a "good enough" -on average- hashing function is generally easy (e.g. use a linear combination with primes coefficients of the hash of constituent parts, see Bezout's theorem). And usually, in practice, such "easy" hash functions work well enough (of course there are exceptions, and the improbable worst case could be aweful).

Read also about perfect hash, e.g. using GNU gperf.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • +1 (much earlier) for lots of good information. Strangely, with GCC your hash algos seem to work far better than simple using the pointer for ~10,000 pointers, but then far worse around 1 million pointers. Goodness knows why... maybe my test's screwy somehow - code at ideone.com/hcriMt – Tony Delroy Aug 29 '14 at 03:26
0

If you want reference equality, there's nothing wrong with using the address as the hashcode. However, a much saner way to implement it than the union is to using intptr_t. In case intptr_t is larger than int32_t, you can AND with -1 and then static_cast to uint32_t.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
0

You just need a function template. Here is what you could do (assuming your hashtable has a limited number of buckets) :

template <typename T>
uintptr_t hashcode(const T &obj, size_t size) {
    return reinterpret_cast<uintptr_t>(&obj) % size;
}
Rerito
  • 5,886
  • 21
  • 47