-2

I am writing a hash table implementation that can accept arbitrary data sizes as keys and custom hashing functions (returning uint32_t).

I have one instance of this hash table that uses 16-byte UUIDs as keys. Since UUIDs have a very wide distribution already, I thought that the hashing function could be simply

typedef unsigned char UUID[16];

static inline uint32_t uuid_hash_fn(const UUID uuid)
{ return *((uint32_t*)(uuid + 4)); }  // skipping the 4 MSB that are constant in UUID

Anything wrong with this?

Is the function, as it is written, already optimized in the sense that it just takes enough left-most bytes to fill an uint32?

EDIT: following commenters' advice, I'd like to clarify that by "is there anything wrong by this" I meant: is there a possibility of unexpected behavior with this approach?

user3758232
  • 758
  • 5
  • 19
  • Is something wrong with this code or are you simply looking for "best implementation" advice? – Abion47 Sep 24 '20 at 18:27
  • "best and fastest" implementation advice. – user3758232 Sep 24 '20 at 18:32
  • And also if I'm making any gross miscalculation since I am not a C senior. – user3758232 Sep 24 '20 at 18:33
  • I thought so. StackOverflow is primarily for people who have specific programming questions or (especially) problems with code. While you might get some people to answer your question, it will likely get closed as opinion-based and you'd probably be better off asking this question on [Code Review](https://codereview.stackexchange.com/). – Abion47 Sep 24 '20 at 18:35
  • 1
    @Abion47 Not if it's not implemented yet. – Mast Sep 24 '20 at 18:36
  • @Abion47 I think that kind of cast can result in undefined behavior, certainly in C++ if not C. That would make it not opinion-based. – Mark Ransom Sep 24 '20 at 18:37
  • My code might compile and run but it's hard for me to diagnose whether I am generating good hashes. I guess my programming-specfic question is whether this is a reliable hashing function. – user3758232 Sep 24 '20 at 18:38
  • @MarkRansom this is strictly in C. If I cast a specific memory address that I know is filled by me with random bytes, how can that result in undefined behavior? – user3758232 Sep 24 '20 at 18:39
  • 1
    @MarkRansom The code may contain objective errors, but that wasn't the question asked. If the question is reworded to ask if errors or undefined behavior exist in the code rather than asking if this is "optimized" or what the "best implementation" of this hashing function is, then it would be on topic here. – Abion47 Sep 24 '20 at 18:41
  • If you were generating "good" hashes, there would be no need for the UUID to be as long as it is. It might be acceptable for purpose, except you haven't told us what the purpose is with any detail. I can't speak to the undefined behavior, except to note that the standards committees take pains to accommodate weird architectures by putting things off limits even when they might seem straight forward. – Mark Ransom Sep 24 '20 at 18:51
  • Potential UB due to `(uint32_t*)(uuid + 4)` lacking proper alignment for `uint32_t *`. – chux - Reinstate Monica Sep 24 '20 at 19:02
  • @chux I'm not sure I understand your point. Is a 4-byte jump not aligned with uint32_t? Or are you referring to something else? – user3758232 Sep 24 '20 at 20:30
  • 1
    @user3758232 A 4-byte jump is aligned with `uint32_t` if the base address was aligned. Given `uuid` only needs `char` aligmentment, it is not certain `uuid+4` is aligned for `uint32_t`. Overlaying with a `union` and other alignment code may solve this corner case. – chux - Reinstate Monica Sep 24 '20 at 21:07
  • @MarkRansom: UUID is "very long" because they're supposed to be universally unique and because of the "birthday paradox" (essentially; to minimize chance of your UUID colliding with any other UUID from any other software in the world). – Brendan Sep 25 '20 at 02:23
  • @Brendan we don't actually need to know why UUIDs are the length they are. The problem is that we don't know the hash quality of a random set of 4 bytes plucked from the 16. And I would argue that a much smaller UUID could have met the requirements with a sufficient margin, so there must be bits that don't contribute as much to the uniqueness than others. – Mark Ransom Sep 25 '20 at 04:06
  • @MarkRansom: I'd argue that if the bits are good enough to avoid "UUID collisions with every other UUID in the world" then a subset is also good enough as a 32-bit hash for a hash table; and wasting more CPU time (e.g. creating a full hash from the whole 16 bytes of the UUID) to save less CPU time (from a "negligibly different' number of hash collisions) will just make performance worse. – Brendan Sep 25 '20 at 04:37
  • @Brendan you can argue all you want but in the case of uuids it can be wrong see how type v1 uuid is defined and this could possibly generate the same hash for all uuids generated on one machine over a long time period. – Antti Haapala -- Слава Україні Sep 25 '20 at 05:11
  • @AnttiHaapala: Do you really expect me to believe that "60-bit timestamp of 100-nanosecond intervals" is going to remain the same over a long time period? The only thing that matters (for extracting a 32-bit hash) is to choose the best part of the UUID, which is something I'd expect the original poster already did for whatever their UUIDs are. – Brendan Sep 25 '20 at 07:46
  • It may be worth noting that my hash table implementation, like all textbook and real ones I have seen, does not rely on "strong" hashing. The hash is not to guarantee uniqueness (which is done with several methods like double hashing, open addressing, robin-hood, etc.) but to optimize bucket space usage. Achieving a decent distribution with 4 random bytes is good enough for millions of records. – user3758232 Sep 26 '20 at 03:40
  • Hey, thanks for all the comments. And I don't mind the down votes since I learned something here... – user3758232 Sep 26 '20 at 03:44
  • "Achieving a decent distribution with 4 random bytes is good enough for millions of records." --> Is true if the bytes are random - but they are not as random as one may think. Rather than optimize locally, look to overall performance. IOWs, to optimize well, look to the hash table functions as a set – chux - Reinstate Monica Sep 26 '20 at 20:59

2 Answers2

3

is there a possibility of unexpected behavior with this approach?

Yes. Code has UB as (uint32_t*)(uuid + 4) relies on uuid being aligned for uint32_t access. uuid only requires char alignment.

UUIDs have a very wide distribution already

Hmm, perhaps not. I'd go for a hash of the entire object and trust my compiler to emit efficient code for memcpy().

// Example that uses the entire UUID.
static inline uint32_t uuid_hash_fn(const UUID uuid) {
  uint32_t hash[4];
  memcpy(hash, &uuid, sizeof hash);
  return hash[0] ^ hash[1] ^ hash[2] ^ hash[3];
}

A better hash might use 64-bit modulo 32-bit prime and the largest 32-bit prime

static inline uint32_t uuid_hash_fn(const UUID uuid) {
  uint64_t hash[2];
  memcpy(hash, &uuid, sizeof hash);
  return (hash[0] ^ hash[1])%4294967291u;
}

// skipping the 4 MSB that are constant in UUID is unclear as code attempts to uses 4 out of 16 bytes skipping the 8 MSbytes.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
1

The basic idea of what you're doing (recycling existing random bits as hash) is fine; as long as you don't need cryptographically secure hashes (which I'd doubt).

The other answer is right about the alignment problem. To fix that either:

a) use typedef uint32_t UUID[4]; instead. This is likely to be better/faster for all code and would mean that you can return UUID[1]; with no casts. It's likely to also take you longer to make this change (depending on where the original UUID came from and what you do with it elsewhere)

b) Use something like return (uint32_t)UUID[4] | ((uint32_t)UUID[5] << 8) | ((uint32_t)UUID[6] << 16) | ((uint32_t)UUID[7] << 24); so it doesn't need alignment. This is a simple change, but gives worse performance.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • I like this answer best because it addresses the alignment issue and also does not assume a crypto hashing which would be "bad" (=unnecessarily slow) for a hash table. – user3758232 Sep 26 '20 at 03:42