0

In order to identify an object in our database and make lookup easier, I have decided to hash the values that identify the object and store them in our database.

These values could be strings, guids, decimals, booleans or datetimes. So for example, we could have

object 1: "test 1", 1, Guid.NewGuid() -> combined hash of these values = 1
object 2: "test 2", 2, DateTime.Now -> combined hash of these values = 2

In this case object 1 and object 2 are different because their hashes are different.

In my first attempt, I used the HashCode of the values and combined them in an order independent way (see here). However I quickly came across a number of issues with this:

  1. I was getting collisions with sequential guids
  2. Hashes may not be unique across app domains

I was thinking of using SHA256 to generate the hashcodes, but how can I combine them in an order-independent way?

Essentially my requirements for my hashing function H are

  1. Commutative: H(1, "test") = H("test", 1)
  2. Injective: So the hash of a tuple should not equal the hash of another tuple unless the tuples themselves are equal (i.e. the tuple components are equal)
  3. Cryptographically secure hashes are not needed
  4. Making the process reversible is not needed

What could I use for this?

Umair
  • 3,063
  • 1
  • 29
  • 50
  • 1
    Your second requirement is basically infeasible, as strings can be arbitrarily long, unless you're willing to have arbirarily long hashes, too. If you want it to become "there's a really, really small chance of hash collisions" that's a different matter. – Jon Skeet Oct 11 '17 at 16:13
  • Consider XORing the individual hashes together, for example. – Jon Skeet Oct 11 '17 at 16:15
  • Couldnt you just simply concat a string representation of each column? like "test 1"|"1"|"xxxx" ? (Instead of | you can use something else) – Rand Random Oct 11 '17 at 16:16
  • 1
    You're going the wrong way; hashes are by definition allowed to have collisions. You need to create a class and define an equality method that ignores the order of values in a sequence. – Dour High Arch Oct 11 '17 at 16:18
  • @JonSkeet We have minimal length on the strings. Enforced by our database. I cannot use HashCodes because I am getting `H("abc") != H("abc")` if run on different app domains (for my test) – Umair Oct 11 '17 at 16:19
  • @RandRandom `H(x,y) = x.ToString() + "|" + y.ToString()` is not commutative! – Umair Oct 11 '17 at 16:20
  • Sure, you shouldn't use GetHashCode for anything persistent - but you could take a SHA256 hash of each element and XOR those. But you still need to handle potential collisions. (In particular, H("a", "a") and H("b", "b") would both be 0, for example. There may be ways around that, but you should consider whether it's a likely scenario or not first.) – Jon Skeet Oct 11 '17 at 16:23
  • You could sort the string before concating them, so if you have "test", "1" and "1", "test" and sort it alphabetically - I believe a sort will always return the same result - you should get "1", "test" for both cases and then concat them. – Rand Random Oct 11 '17 at 16:27
  • @JonSkeet Ah I see what you mean sorry. I will investigate! Thanks. – Umair Oct 11 '17 at 16:35
  • @Umair. What do you mean by "make lookup easier"? Actually it what indexes in database should do. Could you please clarify what kind of lookups do you have? – Valerii Oct 11 '17 at 17:39
  • @JonSkeet In order to prevent the `H("a", "a") = H("b", "b") = 0` scenario, is there something I could use to smooth out the hash? This will be an extreme edge case for us but would like to prevent it from happening. – Umair Oct 12 '17 at 17:03
  • @Umair: Well you could just use some sort of addition operation instead (treat the hashes as 256-bit integers, add them together, truncating to 256 bits afterwards). – Jon Skeet Oct 12 '17 at 17:36

0 Answers0