EqualityComparer with stable HashCode and option to ignore diacritics?

Question

I'm looking for a IEqualityComparer<String> that supports stable HashCode, ie the same HashCode between executions/processes. I also need it to ignore casing and nonspacing combining characters (such as diacritics).

Are there any "easy" ways of accomplishing this in .NET? I have started on my custom implementation with a stable HashCode that ignores casing but I'm beginning to wish I could use the already existing implementations in .NET somehow.

The built-in string comparer adds some random seed to HashCodes between procesee to not make it stable (I think because they cannot guarantee it will remain stable between .NET runtimes?) but I think I can handle that by just making sure the HashCodes I persist gets wiped/rebuilt when moving to another runtime.

In any case, is there any way to access the inner checksum calculation (without the randomness)? Perhaps with reflection?

Update: I'm not an expert on the why but it's evident that the HashCode is calculated differently between runtime. I need it because I have a disk based lookup index that is using the hashcode for strings as keys and since it is persistent I obviously need them to be the same between runtime. I could calculate my own checksums in any way I like of course but since .NET already do a very good job with this I wish I could take advantage of that. But without the "seed" or what you want to call it, the thing that makes the hashcodes different between runtimes.

`adds some random seed to HashCodes` no. The hash code is generated by the *objects*, not the comparer. `String` doesn't What are you really asking and why? — Panagiotis Kanavos, Feb 27 '19 at 09:00
Possible duplicate of [Persistent hashcode for strings](https://stackoverflow.com/questions/36845430/persistent-hashcode-for-strings) — mjwills, Feb 27 '19 at 09:00
Check [String.GetHashCode](https://referencesource.microsoft.com/#mscorlib/system/string.cs,833) to see how the hash code is actually calculated. Randomized string hashes are controlled by a [config file switch](https://learn.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/runtime/userandomizedstringhashalgorithm-element). The default though is *false*, which means there's no randomization — Panagiotis Kanavos, Feb 27 '19 at 09:03
`.Net Core` indeed uses a random element when generating an hash code for strings, this is done to mitigate some hash based DoS attacks. Some details: https://andrewlock.net/why-is-string-gethashcode-different-each-time-i-run-my-program-in-net-core/. But I have to agree with other comments - you usually don't want to keep hash codes and care about their values — Matan Shahar, Feb 27 '19 at 09:03
@AndreasZita `I have a disk based lookup index that is using the hashcode for strings as keys and since it is persistent` databases already use such techniques when indexing text. They're actually *faster* than simple hashes. Why not use a lookup table with adequate indexing? — Panagiotis Kanavos, Feb 27 '19 at 09:08
@AndreasZita what are you *actually* trying to do? No StringComparer will ever use the database or lookup table. Its job is to compare two strings. The typical way is to compare their hashes, and if they match, compare their values. They don't need a *stable* hash code for this. — Panagiotis Kanavos, Feb 27 '19 at 09:11
`I need it because I have a disk based lookup index that is using the hashcode for strings as keys and since it is persistent I obviously need them to be the same between runtime. I could calculate my own checksums in any way I like of course but since .NET already do a very good job with this I wish I could take advantage of that.` Read the duplicate I provided, and its duplicate. You will need to build this yourself. The duplicate, and its duplicate, will show you ways to build it. — mjwills, Feb 27 '19 at 09:15
@AndreasZita hash storage would make sense if you wanted to implement string *searching*, not hashing. Hashes would be useful if your search class wanted to compare a string's hash against a pre-calculated table of hashes. In that case you should explicitly use a hash algorithm appropriate for text comparisons. — Panagiotis Kanavos, Feb 27 '19 at 09:15
I have a kind of Dictionary that only ever is on disk (all unique keys, duplicate entries and values) which campares keys and then values etc, just like a normal dictionary, but on disk. So I need a stable checksum for the keys. Whats wrong with that? — Andreas Zita, Feb 27 '19 at 09:15
@AndreasZita that you're using the wrong terms, asking the wrong questions. You don't need a custom StringComparer for this. You don't have anything to compare. You want to *find* one string's hash in a list of hashes. — Panagiotis Kanavos, Feb 27 '19 at 09:16
Sorry, the StringComparer in .NET implements IEqualityComparer as well, that is what I was asking for really. I will read the duplicates and come back here later. Perhaps the config UseRandomizedStringHashAlgorithm is just what I need. — Andreas Zita, Feb 27 '19 at 09:18
@AndreasZita no it isn't. You keep asking the wrong questions. How is a *comparer* going to be used with a *list of hashes*? Its job is to compare *two strings*, not lookup anything — Panagiotis Kanavos, Feb 27 '19 at 09:19
@AndreasZita for one thing, any hash algorithm you intend to use is probably *wrong* for text. Will you treat uppercase and lowercase letters as the same or different values? Accents? There are text hashing algorithms that take care of such things and are a lot faster than generic or cryptographic hashes. How *are* you going to use that hash index anyway? — Panagiotis Kanavos, Feb 27 '19 at 09:21
@AndreasZita besides, hash tables aren't the fastest option when it comes to string searching. There are things like eg tries/prefix trees that take a lot less space and are a lot faster at finding matches — Panagiotis Kanavos, Feb 27 '19 at 09:22

EqualityComparer with stable HashCode and option to ignore diacritics?

0 Answers0