Simple example:
public class SomeClass
{
public string Str1 { get; set; }
public string Str2 { get; set; }
public string Str3 { get; set; }
public string Str4 { get; set; }
public byte[] SHA256()
{
using (var sha256 = new SHA256Managed())
{
var strings = new[] { Str1, Str2, Str3, Str4 };
for (int i = 0; i < strings.Length; i++)
{
string str = strings[i];
if (str != null)
{
// Commented lines are for using ToUpperInvariant()
//str = str.ToUpperInvariant()
byte[] length2 = BitConverter.GetBytes(str.Length);
sha256.TransformBlock(length2, 0, length2.Length, length2, 0);
// byte[] sortKeyBytes = Encoding.UTF8.GetBytes(str);
byte[] sortKeyBytes = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(str, CompareOptions.IgnoreCase).KeyData;
sha256.TransformBlock(sortKeyBytes, 0, sortKeyBytes.Length, sortKeyBytes, 0);
}
else
{
byte[] length2 = BitConverter.GetBytes(-1);
sha256.TransformBlock(length2, 0, length2.Length, length2, 0);
}
}
sha256.TransformFinalBlock(new byte[0], 0, 0);
byte[] hash = sha256.Hash;
return hash;
}
}
}
I'm using SHA256 and the solution is based on the solution suggested by @usr in https://stackoverflow.com/a/10452967/613130 . The generated hash code is 32 bytes long, but you can truncate it to 20 (clearly you'll reduce its uniqueness).
I prepend the length of the various strings to the strings. In this way { "ABCD", "", "", "" }
will produce a different hash than { "A", "B", "C", "D" }
.
If you prefer you can use good old ToUpperInvariant()
and hash based on it (there are some commented lines in the code... You uncomment them, remove the byte[] sortKeyBytes = CultureInfo.InvariantCulture
and live happy :-) ).
I have to tell the truth, I'm not sure of the "stability" of GetSortKey
... Will GetSortKey
return the same weights in 5 years, in .NET 10.0 with Unicode 11.0? Who knows? I surely don't!
MSDN suggests that they could change:
If an application serializes a SortKey object, the application must regenerate all the sort keys when there is a new version of the .NET Framework.
So in the end I suggest the alternative solution based on .ToUpperInvariant()
(to be clear, if my boss asked me to do it, I would tell him: use .ToUpperInvariant()
). Note that even with .ToUpperInvariant()
there could be small changes in the future. New upper case characters could be introduced for existing lower case characters. See http://unicode.org/faq/casemap_charprop.html "Can a case pair be added if one of the pair is already encoded?"