Folding case to speed up comparisons

Question

"strasse".Equals("STRAße",StringComparison.InvariantCultureIgnoreCase)

This returns true. Which is correct. Unfortunately, when I store one of these in postgres, it thinks they are not the same when doing a case insensitive match (for example, with ~*). I've also tested with citext.

So one solution would be to pre-fold the case, thus storing strasse for either of these values, in another column. I could then index and search on that for matches.

I've been looking for how to fold case in C# for a while, and haven't been able to find a solution in C#. Obviously that knowledge is there because it can compare these strings properly, I just can't find where to get it from.

One solution would be to spawn a perl process perl -E "binmode STDOUT, ':utf8'; binmode STDIN, ':utf8'; while (<>) { print fc }", set the C# side of the process to utf8 for those pipes as well, and just send the text through perl to fold the case. But there has to be a better way than that.

Library [UnidecodeSharp](http://unidecode.codeplex.com/) could be helpful for this. — Ňuf, Apr 05 '18 at 21:02
Ah the good old curse of different implementation of collation :-) — xanatos, Dec 24 '20 at 22:50
What about ```string.Equals(str1,str2,StringComparison.CurrentCulture)``` ? — codebender, Dec 27 '20 at 17:31
@codebender how does that help do this case-insensitive comparison _in postgres_ ? — Tanktalus, Dec 28 '20 at 02:25

Tamir Daniely · Answer 1 · 2021-01-03T08:22:20.190

Looking through the sources I eventually found that most of this implementation is in a set of classes called CompareInfo.

You can find these at github.com/dotnet/runtime

That led me to this page that clues in to the inner workings for the .net culture stuff. .NET globalization and ICU

It seems that dotnet is actually relying completely on native libraries for everything except ordinal operations.

I would assume by this that the .Net Framework is probably using NLS from Win32. For that there is the FoldStringW method that looks promising.

For ICU there is documentation for Case Mappings and I found the u_strFoldCase method.

Charlieface · Answer 2 · 2020-12-31T04:25:48.900

0

There is string.Normalize(), which takes a NormalizationForm parameter. Michael Kaplan goes into detail on this. He claims it does a better job than FoldStringW.

It does not, however, normalize the case to upper or lower, it only folds to the canonical form. I would suggest you just apply ToUpper or ToLower afterwards.

edited Dec 31 '20 at 04:25

answered Dec 29 '20 at 23:14

Charlieface

52,284
6
19
43

The entire original point was to normalise case, although normalising the rest of unicode with combining characters and all that would likely also play a role in matching stuff later. – Tanktalus Dec 31 '20 at 04:14
When I said "normalize case" I meant specifically to upper or lower case, rather than folding to canonical forms, which was also part of the question. – Charlieface Dec 31 '20 at 04:27
ToUpper / ToLower don't work for case-insensitive matches in all languages, that's part of the problem. – Tanktalus Dec 31 '20 at 14:23
Even after `string.Normalize`? – Charlieface Dec 31 '20 at 14:32
Did you try it with the above strings? I just pushed it through net core 3.1, and, unsurprisingly, it doesn't do what is required. – Tanktalus Jan 02 '21 at 20:33

Folding case to speed up comparisons

2 Answers2