1

Using the OpenJDK's hashCode, I tried to implement a generic hashing routine in C:

U32 hashObject(void *object_generic, U32 object_length) {
    if (object_generic == NULL) return 0;

    U8 *object = (U8*)object_generic;
    U32 hash = 1;

    for (U32 i = 0; i < object_length; ++i) {
//      hash = 31 * hash + object[i]; // Original prime used in OpenJDK
        hash = 92821 * hash + object[i]; // Better constant found here: https://stackoverflow.com/questions/1835976/what-is-a-sensible-prime-for-hashcode-calculation
    }

    return hash;
}

The idea is that I can pass a pointer to any C object (primitive type, struct, array, etc.) and the object will be uniquely hashed. However, since this is the first time I am doing something like this, I'd like to ask- Is this the right approach? Are there any pitfalls that I need to be aware of?

strNOcat
  • 913
  • 2
  • 8
  • 22
  • There are no generic objects in C. And we are not a code verification site. – too honest for this site Mar 10 '17 at 02:39
  • @Olaf Generic in the sense that I can pass their pointer (implicitly casted as void*) to this **one** function instead of writing a hashing function for every type (primitive and user-defined) that I use. – strNOcat Mar 10 '17 at 02:41
  • 2
    @Olaf: This is a question. Questions are in fact what we do here. – Ry- Mar 10 '17 at 02:41
  • @Olaf I am not asking for verification of code. I am asking whether this approach of treating any random byte as a number for hashing purpose is the right way to go (algorithm). – strNOcat Mar 10 '17 at 02:43
  • 2
    What is your notion of equality? For a hash to be "correct" it has to satisfy a==b implies hash(a) == hash(b). Floating point values can be equal (for example +0 and -0) without being bitwise the same. – Paul Hankin Mar 10 '17 at 02:57
  • @PaulHankin For the moment, I was simply dealing with integers and strings for which this function seems to work. The idea was to extend it to all other types, but I hadn't considered issues like padding or +0.f/-0.f bitfield representatios. – strNOcat Mar 10 '17 at 03:04
  • @Ryan: It lacks a **specific** problem statement, but asks for a code review/critique. I did not say it is not a question. – too honest for this site Mar 10 '17 at 03:10

3 Answers3

3

There are decidedly pitfalls. The below program using your function, for example, prints a different value for each equivalent object (and a different value every time it’s compiled) under gcc -O0:

#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

struct foo {
    char c;
    int i;
};

static uint32_t hashObject(void const* object_generic, uint32_t object_length) {
    if (object_generic == NULL) return 0;

    uint8_t const* object = (uint8_t const*)object_generic;
    uint32_t hash = 1;

    for (uint32_t i = 0; i < object_length; ++i) {
        hash = 92821 * hash + object[i];
    }

    return hash;
}

int main() {
    struct foo a[2];

    a[0].c = 'A';
    a[0].i = 1;

    a[1].c = 'A';
    a[1].i = 1;

    _Static_assert(
        sizeof(struct foo) == offsetof(struct foo, i) + sizeof(int),
        "struct has no end padding"
    );

    printf("%d\n", hashObject(&a[0], sizeof *a));
    printf("%d\n", hashObject(&a[1], sizeof *a));

    return EXIT_SUCCESS;
}

This happens because padding can contain anything.

Ry-
  • 218,210
  • 55
  • 464
  • 476
  • Padding shouldn't be a problem AFAIK as I am using a zeroed memory pool (mmap'ed) for allocating objects. Is random stuff in padding bytes the only reason it might fail? – strNOcat Mar 10 '17 at 02:46
  • @NamanDixit: Yes, I’m pretty sure it’s the only reason, but I would certainly not count on being able to avoid opening places for a standard-conformant implementation to change padding bytes in structs in a zeroed memory pool. – Ry- Mar 10 '17 at 02:56
  • Pragmatically speaking, does it actually happens on Linux/Windows/MacOS under MSVC/GCC/Clang? I did some quick testing and it seems that zeroed bytes stay zero. (But yes, it is a issue I hadn't though about). – strNOcat Mar 10 '17 at 03:08
  • @NamanDixit: Er, scratch what I said about that being the only reason. Paul Hankin is right about negative zeroes also being a problem, both with floats and probably with one’s complement integers. – Ry- Mar 10 '17 at 03:11
  • @NamanDixit: See also https://stackoverflow.com/questions/10954393/when-are-pad-bytes-copied-struct-assignment-pass-by-value-other#comment14315872_10954584 – hard to know if it happens in practice under some circumstances without looking inside the compiler, which is something I can’t do in the case of MSVC. – Ry- Mar 10 '17 at 05:03
1

In the comments you ask what would happen if you zero out the struct object before using them.

It would not help. The hashes could still be different because padding bytes take unspecified values when a value is stored into a struct object or a member of a struct object1. The unspecified values may change on every store.

There is an additional problem, with other types. Any scalar type (pointer, integers, and floating types) may have different representations of the same value. This is a similar problem as struct types have with padding bytes, mentioned above. The bit representations of scalar objects may change, even though the value did not, and the resulting hash will be different.


(Quoted from: ISO/IEC 9899:201x 6.2.6 Representation of types 6.2.6.1 General 6)
When a value is stored in an object of structure or union type, including in a member object, the bytes of the object representation that correspond to any padding bytes take unspecified values.

Community
  • 1
  • 1
2501
  • 25,460
  • 4
  • 47
  • 87
0

No.

std::vector<int> v1 = {1, 2, 3, 4};
std::vector<int> v2 = {1, 2, 3, 4};

std::cout << "hash1=" << hashobject(&v1, sizeof(v1)) 
    << "hash2=" << hashobject(&v1, sizeof(v1)) << std::endl;

would report two different hash values, which is probably not the intended behaviour.

PS: the question is about C rather than the C++, but the similar class can be in C.

user31264
  • 6,557
  • 3
  • 26
  • 40
  • 2
    The question is about C, but it's a good point nonetheless: if the object contains a pointer, the hash might be different even if the data pointed to is equivalent. – Cris Luengo Mar 10 '17 at 02:56
  • Please observe the language tags on the question. This post should only have C answers. – 2501 Mar 10 '17 at 07:59