2

I’m passing a name string and its SHA1 value into a database. The SHA value is used as an index for searches. After the implementation was done, we got the requirement to make searching the name case insensitive. We do need to take all languages into account (Chinese characters are a real use case).

I know about the Turkey Test. How can I transform my input string before hashing to be case insensitive? Ideally I’d like it to be equivalent of InvariantCultureIgnoreCase.

In other words, how do I make the output of this function case insensitive?

private byte[] ComputeHash(string s)
{
     byte[] data = System.Text.Encoding.Unicode.GetBytes(s ?? string.Empty);
     SHA1 sha = new SHA1CryptoServiceProvider();     // returns 160 bit value
     return sha.ComputeHash(data);
}

If SHA isn’t possible, I might be able to make String.GetHashCode() work, but I don’t see a way to make that case insensitive either.

I'm betting this isn't possible. If its not, what are some work arounds?

Community
  • 1
  • 1
ErnieL
  • 5,773
  • 1
  • 23
  • 27

3 Answers3

7

The existing answers suggesting to use ToLower(Invariant) are wrong: comparing strings after doing ToLower is not equal to doing a string.Compare(xxxIgnoreCase). See the accepted answer here: String comparison - strA.ToLower()==strB.ToLower() or strA.Equals(strB,StringComparisonType)? it breaks down for certain kinds of characters.

The solution is to create a so called SortKey for every string. Such a SortKey essentially is a byte-array with the property that equal bytes mean equal strings. (Also, SortKeys can be compared in a binary way yielding the exact same order that string.Compare yields. But we don't need that property here).

Summary: Use CompareInfo.GetSortKey(string).KeyData to get a hashable byte[]. (GetSortKey on MSDN) This works for all possible cultures. It also works for case-insensitivity.

So a case-insensitive hash for any given string (even with turkish i) can be obtained with:

var sortKeyBytes = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(anyString,
    CompareOptions.IgnoreCase).KeyData;
int hashCode = HashByteArray(sortKeyBytes); //Need to provide this function.
...

We can't use GetHashCode() of byte[] as this method is not overridden for byte[] and therefore defaults to object.GetHashCode() which uses object identity and not value.

You can use the hash function from this answer. It's not good but it does the job.

Community
  • 1
  • 1
usr
  • 168,620
  • 35
  • 240
  • 369
  • this is interesting. The question I have is: How unique are the resulting SortKeys. Before hashing, you must have unique values. Since I don't know the answer, I will bow out. – Tergiver May 04 '12 at 16:54
  • Why would you need unique values before hashing? The resulting sort keys represent a string exactly. There are no wrong collisions. – usr May 04 '12 at 17:07
  • The purpose of producing a hash is to use that hash later to compare against something else. If you obliterate the uniqueness of the data prior to hashing, you have lost the information you were intending on producing. – Tergiver May 04 '12 at 17:15
  • Again, I'm not saying this is wrong (I gave you the only upvote so far). I don't know what exactly is produced by GetSortKey, so I can't dispute it. I would want to know before I used it and since this is not my problem, I'm not keen on expending the energy to test it. – Tergiver May 04 '12 at 17:17
  • SortKey takes the case into account, so this won't work as a case insensitive approach will it? – NominSim May 04 '12 at 18:32
  • 1
    You can specify what type of comparison the sort key should represent. The method is defined like this: SortKey GetSortKey(string source, CompareOptions options). The CompareOptions contain, among others, the option to ignore case. The SortKey system is designed to exactly represent everything that string.Equals and string.Compare do internally. No loss of information at all. – usr May 04 '12 at 19:12
  • 1
    I've suggested your response to another similar question (http://stackoverflow.com/questions/36061627/hashcode-varbinary20-c-sharp/36062012#36062012)... But now I'm thinking about it... What about stability? Unicode changes during time... The collations change during time... Couldn't they change the weights of the characters? – xanatos Mar 17 '16 at 13:40
  • @xanatos you're right, that's a problem. NTFS stores Unicode tables internally on volume creation to make sure it's b-trees stay consistent. SQL Server uses it's own collations, too. Not sure if Windows is capable of providing stable Unicode data. – usr Mar 17 '16 at 14:03
  • @xanatos MSDN [mentions versioning](https://learn.microsoft.com/en-us/windows/win32/intl/handling-sorting-in-your-applications#use-sort-versioning), and AFAICT it does indeed indicate that weights will change over time :( – Cocowalla May 14 '21 at 20:45
6

You could use s.ToUpperInvariant() prior to generating the hash. As long as you do it both ways (generating the original hash, and generating a hash to test against the original), it will work.

Tergiver
  • 14,171
  • 3
  • 41
  • 68
  • @usr Actually, it will work just fine. It further reduces the "strength" of the hash, but that should be obvious. – Tergiver May 04 '12 at 16:23
  • You probably don't want to use the Invariant version of ToLower, but a culture-specific overload. They stipulated Invariant in the question, so I went with it here. – Tergiver May 04 '12 at 16:25
  • +1. One may also need to String.Normalize both string the same way before ToLowerInvariant. – Alexei Levenkov May 04 '12 at 16:30
  • I'm unfamiliar with `String.Normalize`. Reading the MSDN docs on it leads me to want to ask a Unicode expert. You might ask another question, "Should I Normalize a string before hashing it?" A quick Google search didn't turn up anything helpful. – Tergiver May 04 '12 at 16:40
  • comparing strings after doing tolower is **not** equal to doing a string.Compare(xxxIgnoreCase). See the accepted answer here: http://stackoverflow.com/questions/1660192/string-comparison-stra-tolower-strb-tolower-or-stra-equalsstrb-stringcom it breaks down for certain kinds of characters. – usr May 04 '12 at 16:42
  • There's no other way to hash a string in a case-insensitive way. If that's what you need, you'll have to live with the (rare) possible failure case. – Tergiver May 04 '12 at 16:44
  • @usr btw, the MSDN doc linked in that accepted answer says, "DO: Use ToUpperInvariant rather than ToLowerInvariant when normalizing strings for comparison." So I should change the answer. – Tergiver May 04 '12 at 16:48
  • @Tergiver, there **is** a way. See my answer. – usr May 04 '12 at 16:51
2

To make something case insensitive, remove the case:

s = s.ToLowerInvariant();

Do not use CurrentCulture if you can't store it into database and use to convert other string for match like:

s = s.ToLower(System.Globalization.CultureInfo.CurrentCulture);

You may consider using another (non Invariant) culture all the time, but it could be surprise for future code maintainer (one normally expects either Current or Invariant culture for all string operations).

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
NominSim
  • 8,447
  • 3
  • 28
  • 38