0

So, I saw this question here on Stack Overflow (the question), and it says;

Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).

So I am now wondering if I can, for example, read a file that contains unicode 13.0 characters, or am I missing something?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
jeems
  • 113
  • 4
  • 4
    Sounds to me that this means the compiler can handle _source files_ from Unicode 6.0 and newer. Doesn't sound to me that this means a runtime limitation at all. – Uwe Keim Aug 10 '20 at 05:45
  • 1
    @UweKeim love the pic :) –  Aug 10 '20 at 05:49
  • Thanks, @MickyD Practice, what you preach. – Uwe Keim Aug 10 '20 at 05:53
  • 1
    That thread discussed a very specific part of .NET Framework Unicode support, so unless you ask a similar one, it would be too broad to discuss. If the scope is just whether C# compiler can "read a file that contains Unicode 13.0 characters", edit the question to be that specific. – Lex Li Aug 10 '20 at 05:54

1 Answers1

3

There are three things at play here:

  • The compiler, which is only relevant for source file handling. If you try to compile code that includes characters the compiler is unaware of, I would expect the compiler to treat those characters as "unknown" in terms of their Unicode category. (So you wouldn't be able to use them in identifiers, they wouldn't count as whitespace etc.)
  • The framework, which is relevant when you use methods that operate on strings, or things like char.GetUnicodeCategory() - but which will let you load data from files even if it doesn't "understand" some characters.
  • Whatever applications do with the data - often data is just propagated from system to system in an opaque way, but often there are also other operations and checks performed on it.

If you need to store some text in a database, and then display it on a user's screen, it's entirely possible for that text to go through various systems that don't understand some characters. That can be a problem, in terms of areas such as:

  • Equality and ordering: if two strings should be equal in a case-insensitive comparison, but the system doesn't know about some of the characters within those strings, it might get the wrong answer
  • Validation: if a string is only meant to contain characters within certain Unicode categories, but the system doesn't know what category a character is in, it logically doesn't know for sure whether the string is valid.
  • Combining and normalization: again in terms of validation, if your system is meant to validate that a string is only (say) 5 characters long, but that's in a particular normalization form, then you need to be able to perform that normalization in order to get the right answer.

(There are no doubt lots of similar other areas.)

The compiler is basically the least important part of this - it does matter what level of support the framework has, but whether it's actually a problem to be a bit out of date or not will depend on what's happening with the data.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194