11

I encountered an incompatible problem while I was trying to upgrade my projects from .NET core 3.1 to the latest .NET 5.

My original code has a validation logic to check invalid file name characters by checking each character returned from Path.GetInvalidFileNameChars() API.


var invalidFilenameChars = Path.GetInvalidFileNameChars();
bool validFileName = !invalidFilenameChars.Any(ch => fileName.Contains(ch, StringComparison.InvariantCulture));

Suppose you give a regular value to fileName such as "test.txt" that should be valid. Surprisingly, however, the above code gives the file name is invalid if you run it with 'net5' target framework.

After spend some time on debugging, what I found is that the returned invalid character set contains '\0', null ASCII character and "text.txt".Contains("\0, StringComparison.InvariantCulture) gives true.

    class Program
    {
        static void Main(string[] args)
        {
            var containsNullChar = "test".Contains("\0", StringComparison.InvariantCulture);
    
            Console.WriteLine($"Contains null char {containsNullChar}");
        }
    }

If you run in .NET core 3.1, it never says regular string contains null character. Also, if I omit the second parameter (StringComparison.InvariantCulture) or if I use StringComparison.Ordinal, the strange result is never returned.

Why this behavior is changed in .NET5?

EDIT: As commented by Karl-Johan Sjögren before, there is indeed a behavior change in .NET5 regarding string comparison:

Behavior changes when comparing strings on .NET 5+

Also see the related ticket:

string.IndexOf get different result in .Net 5

Though this issue should be related to above, the current result related to '\0' still looks strange to me and might still be considered to be a bug as answered by @xanatos.

EDIT2:

Now I realized that the actual cause of this problem was my confusion between InvariantCulture and Ordinal string comparison. They are actually quite different things. See the ticket below:

Difference between InvariantCulture and Ordinal string comparison

Also note that this should be unique problem of .NET as other major programming languages such as Java, C++ and Python treat ordinal comparison by default.

Shayan Shafiq
  • 1,447
  • 5
  • 18
  • 25
Ryo Asai
  • 1,088
  • 1
  • 6
  • 14
  • It was probably a bug in the first place. All strings are internally stored as WCHAR* and therefore null-terminated, probably it's an off-by-one error – Charlieface Jan 05 '21 at 07:49
  • 5
    Yes it was changed in .Net 5. https://learn.microsoft.com/en-us/dotnet/standard/base-types/string-comparison-net-5-plus – Karl-Johan Sjögren Jan 05 '21 at 07:56
  • 1
    @Charlieface Though the NUL character is not used in C# to detect the string end (as "a\0b".Length returns 3) and mostly present to make interop easier. – ckuri Jan 05 '21 at 07:59
  • 2
    The problem is even bigger... `"test".IndexOf("\0", StringComparison.InvariantCulture) == 0` instead of being -1 or 4 – xanatos Jan 05 '21 at 08:29
  • @ryo do you want to open the bug? Otherwise I'll open it for both Contains and IndexOf (and probably other methods) – xanatos Jan 05 '21 at 08:30
  • I wonder where I can file a bug. @xanatos Can you please file a bug? – Ryo Asai Jan 05 '21 at 09:15
  • @RyoAsai aaaaand done :-) – xanatos Jan 05 '21 at 09:27
  • Does this answer your question? [string.IndexOf get different result in .Net 5](https://stackoverflow.com/questions/64833645/string-indexof-get-different-result-in-net-5) – Peter B Jan 05 '21 at 09:31
  • @xanatos Thank you for your quick action! Anyway, for now it would be a safer option to use StringComparison.Ordinal instead of StringComparison.InvariantCulture. – Ryo Asai Jan 05 '21 at 09:32
  • 1
    @RyoAsai I've received a reply. I've expanded the response and added some thought of mine. – xanatos Jan 06 '21 at 09:30
  • @xanatos Thank you so much. The actual problem was that I confused InvariantCulture with Ordinal in string comparison. Now, I know they are quite different! Anyway, now Iknow the ICU string problem is very interesting. – Ryo Asai Jan 06 '21 at 14:51
  • @RyoAsai It wasn't a confusion of yours. It was a common misconception about what the `CultureInvariant` rules were. Sadly I'm in your camp – xanatos Jan 06 '21 at 15:15

1 Answers1

9

not a bug, a feature

The issue that I've opened has been closed, but they gave a very good explanation. Now... In .NET 5.0 they began using on Windows (on Linux it was already present) a new library for comparing strings, the ICU library. It is the official library of the Unicode Consortium, so it is "the verb". That library is used for CurrentCulture, InvariantCulture (plus the respective IgnoreCase) and and any other culture. The only exception is the Ordinal/OrdinalIgnoreCase. The library is targetted for text and it has some "particular" ideas about non-text. In this particular case, there are some characters that are simply ignored. In the block 0000-00FF I would say the ignored characters are all control codes (please ignore the fact that they are shown as €‚ƒ„†‡ˆ‰Š‹ŒŽ‘’“”•–—™š›œžŸ, at a certain point these characters have been remapped somewhere else in the Unicode, but the glyps shown don't reflect it, but if you try to see their code, like doing char ch = '€'; int val = (int)ch; you'll see it), and '\0' is a control code.

Now... My personal thinking is that to compare string from today you'll need a master's degree in Unicode Technologies , and I do hope that they'll do some shenanigans in .NET 6.0 to make the default comparison Ordinal (it is one of the proposals for .NET 6.0, the Option B). Note that if you want to make programs that can run in Turkey you already needed a master's degree in Unicode Technologies (see the Turkish i problem).

In general I would say that to look for words that aren't keywords/fixed words (for example column names), you should use Culture-aware comparisons, while to look for keywords/fixed words (for example column names) and symbols/control codes you should use Ordinal comparisons. The problem is when you want to look for both at the same time. Normally in this case you are looking for exact words, so you can use Ordinal. Otherwise it becames hellish. And I don't even want to think how Regex works internally in a Culture-aware environment. That I don't want to think about. Becasue in that direction there can only be folly and nightmares .

As a sidenote, even before the "default" Culture-aware comparisons had some secret shaeaningans... for example:

int ix = "ʹ$ʹ".IndexOf("$"); // -1 on .NET Framework or .NET Core <= 3.1

what I had written before

I'll say that it is a bug. There is a similar bug with IndexOf. I've opened an Issue on github to track it.

As you have written, the Ordinal and OrdinalIgnoreCase work as expected (probably because they don't need to use the new ICU library for handling Unicode).

Some sample code:

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");

and

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0test", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCultureIgnoreCase)}");
xanatos
  • 109,618
  • 12
  • 197
  • 280