1

I have this class:

public class SomeClass
{
    public string Str1 { get; set; }
    public string Str2 { get; set; }
    public string Str3 { get; set; }
    public string Str4 { get; set; }
}

and I would like to create a hashkey, which is persisted in a database as varbinary(20), to determine uniqueness of the class (case insensitive). I guess the usual GetHashCode method could not be used in this case. What would be best practice in this scenario?

cs0815
  • 16,751
  • 45
  • 136
  • 299
  • *I guess the usual GetHashCode method could not be used in this case.* You guessed it right :-) +1 just for this! You shouldn't EVER persist a GetHashCode – xanatos Mar 17 '16 at 13:16
  • What a useful comment (-: – cs0815 Mar 17 '16 at 13:17
  • See http://stackoverflow.com/q/10452228/613130 – xanatos Mar 17 '16 at 13:17
  • What you said about `GetHashCode` is something very important and quite advanced. I'm sure that 9 out of 10 C# programmers don't know it. – xanatos Mar 17 '16 at 13:18
  • The solution of usr at http://stackoverflow.com/a/10452967/613130 in particular seems to be very beautiful. – xanatos Mar 17 '16 at 13:19
  • Why the restriction of varbinary(20)? - Why not use one of the many existing methods? – Allan S. Hansen Mar 17 '16 at 13:21
  • @AllanS.Hansen such as - I can change the n in varbinary(n). – cs0815 Mar 17 '16 at 13:22
  • I guess by Uniqueness you mean unique Str1, Str2, Str3 and Str4 combination. In that case GetHashCode alone won't be enough to determine it since 2 instances might have the same hash code and different values for these properties. – vc 74 Mar 17 '16 at 13:22
  • yes @vc74 that's correct. – cs0815 Mar 17 '16 at 13:24
  • Couldn't one of the methods in System.Security.Cryptography do the trick? If needing to be case insensitive and it's just string properties - I'd just dump everything into upper or lowercase when using them. – Allan S. Hansen Mar 17 '16 at 13:26

2 Answers2

3

Simple example:

public class SomeClass
{
    public string Str1 { get; set; }
    public string Str2 { get; set; }
    public string Str3 { get; set; }
    public string Str4 { get; set; }

    public byte[] SHA256()
    {
        using (var sha256 = new SHA256Managed())
        {
            var strings = new[] { Str1, Str2, Str3, Str4 };

            for (int i = 0; i < strings.Length; i++)
            {
                string str = strings[i];

                if (str != null)
                {
                    // Commented lines are for using ToUpperInvariant()
                    //str = str.ToUpperInvariant()
                    byte[] length2 = BitConverter.GetBytes(str.Length);
                    sha256.TransformBlock(length2, 0, length2.Length, length2, 0);

                    // byte[] sortKeyBytes = Encoding.UTF8.GetBytes(str);
                    byte[] sortKeyBytes = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(str, CompareOptions.IgnoreCase).KeyData;

                    sha256.TransformBlock(sortKeyBytes, 0, sortKeyBytes.Length, sortKeyBytes, 0);
                } 
                else
                {
                    byte[] length2 = BitConverter.GetBytes(-1);
                    sha256.TransformBlock(length2, 0, length2.Length, length2, 0);
                }
            }

            sha256.TransformFinalBlock(new byte[0], 0, 0);

            byte[] hash = sha256.Hash;
            return hash;
        }
    }
}

I'm using SHA256 and the solution is based on the solution suggested by @usr in https://stackoverflow.com/a/10452967/613130 . The generated hash code is 32 bytes long, but you can truncate it to 20 (clearly you'll reduce its uniqueness).

I prepend the length of the various strings to the strings. In this way { "ABCD", "", "", "" } will produce a different hash than { "A", "B", "C", "D" }.

If you prefer you can use good old ToUpperInvariant() and hash based on it (there are some commented lines in the code... You uncomment them, remove the byte[] sortKeyBytes = CultureInfo.InvariantCulture and live happy :-) ).

I have to tell the truth, I'm not sure of the "stability" of GetSortKey... Will GetSortKey return the same weights in 5 years, in .NET 10.0 with Unicode 11.0? Who knows? I surely don't!

MSDN suggests that they could change:

If an application serializes a SortKey object, the application must regenerate all the sort keys when there is a new version of the .NET Framework.

So in the end I suggest the alternative solution based on .ToUpperInvariant() (to be clear, if my boss asked me to do it, I would tell him: use .ToUpperInvariant()). Note that even with .ToUpperInvariant() there could be small changes in the future. New upper case characters could be introduced for existing lower case characters. See http://unicode.org/faq/casemap_charprop.html "Can a case pair be added if one of the pair is already encoded?"

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • stupid question. this returns byte[32]. Would this mean I have to use varbinary(32) during persistence? – cs0815 Mar 17 '16 at 13:38
  • @csetzkorn Yep, but as I've written, *you can truncate it to 20* `Array.Resize(ref hash, 20)` – xanatos Mar 17 '16 at 13:45
  • yeah sorry overlooked this bit – cs0815 Mar 17 '16 at 13:47
  • ... Isn't a hash unique only when you use it in its full length? – Thorsten Dittmar Mar 17 '16 at 13:47
  • @ThorstenDittmar Clearly by truncating the hash you are reducing its uniqueness. But then, the uniqueness of SHA256 (or of other hashes) is more a stastistical thing... There are too many possible hashes to be probable to find a collision. 20 bytes are still quite much. MD5 has 16 bytes, and was considered secure for many years. – xanatos Mar 17 '16 at 13:57
  • @ThorstenDittmar I'll link this: http://security.stackexchange.com/questions/72673/how-bad-is-it-to-truncate-a-hash The response seems to be "depends on how much you are truncating" – xanatos Mar 17 '16 at 14:04
  • Cryptographic hash algorithms make it hard to create a value that hashes to a specific hash. I don't believe this is an issue here - there is no statement of a security problem in the question. The asker is probably only interested in a good distribution and perhaps fast execution speed. – Martin Liversage Mar 17 '16 at 14:28
  • @MartinLiversage Yep... A [murmurhash](https://en.wikipedia.org/wiki/MurmurHash) would be perfect. Sadly in .NET the only hashes present are cryptographically strong hashes. – xanatos Mar 17 '16 at 14:32
  • @csetzkorn I feel like sh#t... I suggested something just because another high rep user had suggested it, and I felt his suggestion was very good... And I liked it, because it was tricky in a good way... But then I thought at it a little. Please read (or reread) my response from *If you prefer you can use good old*. I've update it (various times :-) ) – xanatos Mar 17 '16 at 15:01
2

A varbinary(20) is 160 bits so you are looking for a 160 bit hash algorithm. The SHA-1 algorithm produces a 160 bit hash value.

It seems that the purpose of your question is to create a hash value that is expected to be unique for a given instance of SomeClass so you should favor fast hash algorithms over cryptographic hash algorithms. SHA-1 is a cryptographic algorithm but it is pretty fast and there is an implementation in the .NET Framework. Also, there exists attacks on the SHA-1 algorithm so you should not use it for cryptographic purposes but instead chose algorithms like SHA-256 (that are slower).

All in all I believe that SHA-1 is a good fit for your problem. It is simple to use the algorithm. 1) Concatenate the strings, 2) convert them to upper case, 3) convert them to bytes using a suitable encoding (I use UTF-8) and 4) compute the hash:

Byte[] GetHash(SomeClass someClass) {
  if (someClass == null)
    throw new ArgumentNullException("someClass");

  var byteBuffers = GetStrings(someClass).Select(
    s => String.IsNullOrEmpty(s)
         ? new Byte[0] : Encoding.UTF8.GetBytes(s.ToUpperInvariant())
  );
  var bytes = byteBuffers
    .Aggregate(new List<Byte>(), (l, b) => { l.AddRange(b); return l; })
    .ToArray();
  using (var sha1 = new SHA1Managed())
    return sha1.ComputeHash(bytes);
}

IEnumerable<String> GetStrings(SomeClass someClass) {
  yield return someClass.Str1;
  yield return someClass.Str2;
  yield return someClass.Str3;
  yield return someClass.Str4;
}

Note that any hash algorithm (also cryptographic algorithms) can and will produce collisions.

Xanatos has a very good point:

I prepend the length of the various strings to the strings. In this way { "ABCD", "", "", "" } will produce a different hash than { "A", "B", "C", "D" }.

Here is an alternative solution that solves the same problem in a slightly different way where each string length modulo 256 is included in the hash:

Byte[] GetHash(SomeClass someClass) {
  if (someClass == null)
    throw new ArgumentNullException("someClass");

  var byteBuffers = GetBuffers(GetStrings(someClass));
  var bytes = byteBuffers
    .Aggregate(new List<Byte>(), (l, b) => { l.AddRange(b); return l; })
    .ToArray();
  using (var sha1 = new SHA1Managed())
    return sha1.ComputeHash(bytes);
}

IEnumerable<String> GetStrings(SomeClass someClass) {
  yield return someClass.Str1?.ToUpperInvariant();
  yield return someClass.Str2?.ToUpperInvariant();
  yield return someClass.Str3?.ToUpperInvariant();
  yield return someClass.Str4?.ToUpperInvariant();
}

IEnumerable<Byte[]> GetBuffers(IEnumerable<String> strings) {
  foreach (var @string in strings) {
    if (!String.IsNullOrEmpty(@string)) {
      yield return new[] { (Byte) (@string.Length%256) };
      yield return Encoding.UTF8.GetBytes(@string);
    }
    else
      yield return new Byte[1];
  }
}
Martin Liversage
  • 104,481
  • 22
  • 209
  • 256