4

Based on my understanding (see my other question), in order to decide whether to test string equality by using ordinal or cultural rules, the semantic of the performed comparison must be taken into account.

If the two compared strings must be considered as raw sequences of characters (in other words, two symbols) then an ordinal string comparison must be performed. This is the case for most string comparisons performed in server side code.

Example: performing a user lookup by username. In this case the usernames of available users and the searched username are just symbols, they are not words in a specific language, so there is no need to take linguistic elements into account when comparing them. In this context two symbols composed by different characters must be considered different, regardless of any linguistic rule.

If the two compared strings must be considerd as words in a specific language, then cultural rules must be taken into account during the comparison. It is entirely possible that two strings, composed by different characters, are considerd the same word in a certain language, based on the grammatical rules of that language.

Example: the two words strasse and straße have the same meaning of street in the german language. So, in the context of comparing strings representing words of the german language this grammatical rule must be taken into account and these two strings must be considered equal (think of an application for the german market where the user inputs the name of a street and that street must be searched into a database, in order to get the city where the street is located).

So far, so good.

Given all of this, in which cases using the .NET invariant culture for strings equality makes sense ?

The point is that the invariant culture (as opposed of the German culture, mentioned in the example above) is a fake culture based on the american english linguistic rules. Put another way, there is no human language whose rules are based on the .NET invariant culture, so why should I compare two strings by using this fictitious culture ?

I know that the invariant culture is typically used to format and parse strings used in machine to machine communication scenarios (such as the contracts exposed by a web API).

I would like to understand when calling string.equals using StringComparison.InvariantCulture as opposed of StringComparison.CurrentCulture (for some manually set thread culture, in order to not depend on the machine OS configuations) really makes sense.

Enrico Massone
  • 6,464
  • 1
  • 28
  • 56
  • Whether strasse and straße are equal, is a function of the domain, not .net or even C#. There are cases where you want to evaluate `(strasse == straße) == true` and also times when `(strasse == straße) == false`. Your business logic should decide how you compare strings... – Austin T French May 11 '20 at 22:14
  • @AustinTFrench totally agree with you. This is the rationale to be used when chosing between ordinal string comparison and culture aware string comparison. My question is whether using the invariant culture, as opposed of a specif culture (en-gb, fr-fr, ecc...) really makes sense for culture aware string comparison. – Enrico Massone May 11 '20 at 22:21
  • 1
    InvariantCulture is a simple answer to the question "if everybody does it differently, then what's the standard?" You may well like it if you have, say, a config file that specifies default values for a floating point number that the user can change. Since you can never guess right at using a comma or decimal point for that user when you deploy that file, you have to pick a standard. Convenient. Make sure it is obvious to the user when they change it, use '.' even if you don't need it. – Hans Passant May 11 '20 at 22:59
  • Consider the case where you have a field that represents the *Name* of something known to the program, but not exposed in the UI. The name will be invariant, not something you will localize – Flydog57 May 11 '20 at 23:44

1 Answers1

4

Combining diacritics / non-normalised strings is one example. See this answer for a decent treatment with code: https://stackoverflow.com/a/31361980/2701753

In summary for (many) 'alphabets' there are several potential Unicode (and UCS-2) representations for the same glyph (letter)

For example:

Unicode Character “á” (U+00E1) [one unicode codepoint]
Unicode Character “a” (U+0061) [followed by] Unicode Character “◌́” (U+0301) [two unicode codepoints]

so:
á
á

Same linguistic string (for all cultures, they are supposed to represent the same character) but different ordinal string (different bytes).

So Invariant equality comparison is [in this case] like normalising the strings before comparing them

Look-up unicode normalisation / decomposition for more info.

There are other interesting cases, ligatures for example. And left to right and right to left marks and ....

So, in summary, once you have 'interesting' alphabets in play (pretty much anything outside pure ascii), once you are interested in any sort of comparison of the strings as linguistic items / streams of glyphs, you probably do want to go beyond ordinal comparison.

To directly answer the question: If you have a multicultural user-base, but still need the above linguistic sensitivity, what culture would you choose for:

StringComparison.CurrentCulture (for some manually set thread culture, in order to not depend on the machine OS configuations)

other than InvariantCulture?

tolanj
  • 3,651
  • 16
  • 30
  • Of course you might need to roll your own, does = A?? Do various whitespaces match up [there are a lot of them] – tolanj May 11 '20 at 22:35
  • so to put it briefly, the invariant culture must be used whenever linguistic sensitivity is needed in string comparisons, but determining a specific language is not possible, because it is not possible to assume that all the users have the same language. So it's a kind of conventional choice for the culture to be used when there is ambiguity. – Enrico Massone May 11 '20 at 23:08