1

I have a structure having 3 integers in [1, 1000] and string.

I need to represent it in 32 bit number such that two structures differing in at least one field would produce different codes, while structures having the same content consistently produce the same code. Usually one of the integer fields would be increased in several units. This should necessarily produce a different code.

At first, I thought to format the structure fields to a string in a constant format and then to hash it using GetHashCode function of String class. But then I read here in some discussions that repeating process runs on the same input don't necessary produce the same hash output. First of all, is this true in .NET 4? It's important to me, because the hash values should be persisted and stay consistent over process runs. I also saw here suggestions to perform bitwise operations of results of the platform GetHashCode applied on each structure field using prime numbers. But here again, apparently I can't count on the consistent result of the process runs.

If I use cryptographic hash functions, I exceed 32 bit.

If I didn't have a string field, I would compose the code as a 32 bit array from the numeric fields. May be it's worth to XOR such bit array with the string field GetHashCode result? Do I increase chances that repeating run on some input will produce the same hash output?

What would you suggest to do?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • 2
    You need a perfect hash if you would like to avoid collisions. Can you uniquely describe each of your structures in just 32 bits? Also, you have no guarantees as to the stability of the built-in hash functions (and you should not either). So what are you using this "hash" for? – user7116 Mar 04 '13 at 22:25
  • possible duplicate of [Create a hashcode of two numbers](http://stackoverflow.com/questions/892618/create-a-hashcode-of-two-numbers) – stakx - no longer contributing Mar 04 '13 at 22:26
  • @sixlettervariables A perfect hash would require knowing the entire value space up front, right? (And for it to contain less than 2**32 elements.) – millimoose Mar 04 '13 at 22:48
  • @user2132086: Are all four fields part of the "identification" of the struct, or are one or more of them simply an "attribute" (to borrow DB terminology)? – Pieter Geerkens Mar 05 '13 at 00:40
  • @PieterGeerkens Yes, all the fields together combine the key – user2132086 Mar 05 '13 at 10:39
  • @user2132086: so why does the key *need* to be 32bits? Besides, you probably do not want a *hash code* per se, as you'll run into some implementation problems if you attempt to persist these. – user7116 Mar 05 '13 at 16:49

3 Answers3

1

If you had the following:

struct 
{
    int A;
    int B;
    int C;
}

Assuming A, B, C are in the range [1, 1000]. It is possible to create a "perfect hash" (without collision) since A, B, C can have each 1000 different possible values. Indeed, log2(1000^3) <= 32 (1000^3 is the number of possible values of the sturcture and the log2 is used to get the number of bits required to store all these values without collision and 32 is the number of bits of an integer).

int MyHashCode()
{
    return 1000 * (1000 * (A - 1) + (B - 1)) + (C - 1);  // There is no overflow or collision since A, B, C are in the range [1, 1000]
}

We can simplify it by using a weaker condition: A, B, C are in the range [0, 1000]:

int MyHashCode()
{
    return 1001 * (1001 * A + B) + C;  // There is no overflow or collision since A, B, C are in the range [0, 1000]
}

Update

Given that your structure contains a string inside it. What you want to achieve is impossible. Because the string can represent an infinite number of values.

If it was possible to do it one could create a very powerful compression algorithm. That could store any file into a... 32 bit number! Mathematically, it comes from the fact that an injective function can only map to a bigger space.

Cédric Bignon
  • 12,892
  • 3
  • 39
  • 51
  • Seeing as there are more than 2**32 possible files, such a compression algorithm provably cannot exist ;) (Okay, technically, a *compression* algorithm can, as long as you don't ever need the *decompression* algorithm.) – millimoose Mar 04 '13 at 22:45
  • @millimoose You are right! Let's create the most powerful compression algorithm : _x => 0_. – Cédric Bignon Mar 04 '13 at 22:52
  • @CédricBignon My string field represents a machine name, so some assumptions, such as length limit, can be done, but I don't see how this helps :-) – user2132086 Mar 05 '13 at 08:12
  • @user2132086 Except if the string can have a maximum of 4 different values, you can't do a perfect hash for this on 32 bit. (Why 4? Because 2^32/1000^3 = 4.29...) – Cédric Bignon Mar 05 '13 at 13:32
1

Anonymous types have an autogenerated sensible GetHashCode() implementation. I'd try just using:

struct MyStruct 
{
    int _intField1;
    int _intField2;
    int _intField3;
    string _stringField;

    public long GetHashCode() 
    {
        return new { _intField1, _intField2, _intField3, _stringField }.GetHashCode();
    }
}

Since both ints and strings are immutable types, the hash code should remain the same between runs of the application as long the underlying .NET framework version is the same. (This might or might not be "persistent enough".)

That said, it might change should the internal implementation of GetHashCode() change. In that case, use a cryptographic hash. It doesn't matter that it exceeds 32 bits, because cryptographic hashes are designed to produce wildly different output for small changes in input. This means that for two different inputs, any given 32 bits of the hash code are very unlikely to be equal. Just use BitConverter.ToInt32() to convert whichever part of the hash you want to an int.

Also, obviously, this will just make it somewhat unlikely that two different structures will produce different hash codes. (This can be determined using an approximate formula for the birthday paradox, if I'm reading wiki correctly, it means you'd have a 10% chance of getting duplicates once you store ~140,000 ~30,000 records. Assuming the cryptographic hash has ideal properties. I'm not sure you can do better without a perfect hash.)

millimoose
  • 39,073
  • 9
  • 82
  • 134
  • 1
    The [default implementation of reference type GetHashCode is not stable between order of execution between executions and per thread](http://stackoverflow.com/questions/8178115/why-does-system-type-gethashcode-return-the-same-value-for-all-instances-and-typ/8178226#8178226). – user7116 Mar 05 '13 at 03:07
  • @millimoose If I use MD5Cng (I think it should be faster), can I assume that any 32 bits of them (I don't know yet how big is the usual result of MD5Cng) are unique? – user2132086 Mar 05 '13 at 08:50
  • @user2132086 The newer hashes (like SHA256) should fare better when it comes to preventing collisions, but generally, yes. Wiki tells me this property is called the [*avalanche effect*](http://en.wikipedia.org/wiki/Avalanche_effect), and it's a thing that cryptography researchers deliberately try to achieve. The basic idea is that the hash code for a given message should seem as if it may as well have been determined randomly. Now obviously you can get collisions if you make random picks from a limited value space, but you can estimate the probability those for a given number of picks. – millimoose Mar 05 '13 at 16:19
  • @sixlettervariables I don't have much to prove this yet, but I strongly doubt *anonymous types, specifically* use the default implementation of `GetHashCode()`. They certainly autogenerate a "sane" implementation of `Equals()`, which implies they should also autogenerate an analogous implementation of `GetHashCode()` that mostly works with hashcodes of the fields (which are all *value* types in this case). I'd bet dollars to donuts the values are stable at least during the lifetime of a .NET framework version. – millimoose Mar 05 '13 at 16:27
  • @sixlettervariables And looking at the decompiled source for the anonymous type, all that is used in its `GetHashCode()` implementation is a bunch of compile-time constants, and `EqualityComparer.Default.GetHashCode()` for the fields which mostly seems to delegate to the `GetHashCode()` of the field values. – millimoose Mar 05 '13 at 16:36
  • @millimoose: you are free to assume hashes won't change, however, as proven they can and do. Best to not rely on the default implementation beyond providing a stable hash value *for the duration of your execution*. – user7116 Mar 05 '13 at 16:47
  • @sixlettervariables I'm not saying that's not a reasonable safe assumption, given the fact that the documentation doesn't promise anything else. What I'm saying is that looking at the implementations of `GetHashCode()` for an anonymous type with value-type(-ish) fields, I can conclude that for the hashes to become unstable would require a change to the C# compiler and a recompilation of your code, or a change in the .NET framework behaviour. Unless I read the code wrong, they *will* remain stable across threads, and executions that involve the same assemblies. – millimoose Mar 05 '13 at 17:20
  • @millimoose: It only becomes unstable if they change *during execution*. They are more than welcome to change the algorithm between executions...and did between multiple versions of the framework. Using a hash value as a stable reference between any two executions is thus wrong. – user7116 Mar 05 '13 at 17:25
0
  1. Serialize your type to a byte[]
  2. Apply common hash algorithm to byte[] to obtain hash byte[]
  3. Pull out, for example, the first 32 bits of the hash byte[] and use that
Timothy Shields
  • 75,459
  • 18
  • 120
  • 173
  • If I use MD5Cng (I think it should be faster), can I assume that the first 32 bits of them (I don't know yet how big is the usual result of MD5Cng) are unique? – user2132086 Mar 05 '13 at 08:50
  • You can 'sort of' assume they're unique. If you're mapping objects larger than 32 bits to 32 bit hash codes, by the so-called pigeonhole principle there will always be some two objects that have the same hash code. The probability of encountering them is very low, though. – Timothy Shields Mar 05 '13 at 15:20
  • @TimothyShields The probability of at least one collision increases rapidly with the amount of objects you're hashing. – millimoose Mar 06 '13 at 18:25
  • @millimoose I'm very aware of that. The original question being asked was essentially "how do I deterministically hash my custom type to a 32-bit hashcode?" If the question is instead "can I use a 32-bit hash code to uniquely identify my objects" then the answer is definitely "no!" (See http://preshing.com/wp-content/uploads/2011/05/small-probabilities.png) – Timothy Shields Mar 06 '13 at 21:01
  • @TimothyShields Hm, that chart means my understanding of the reverse birthday paradox example on Wiki was hilariously wrong, and 32-bit hashes are just completely useless at being "unique enough" – millimoose Mar 06 '13 at 21:14
  • @millimoose Yes, that's why I had assumed that the original question wasn't actually asking for true uniqueness. For the purposes of using a `HashSet` it would be fine, but not for a record identifier. – Timothy Shields Mar 06 '13 at 21:36