1

In many places we can read that, for example, "C# uses UTF-16 for its strings" (link). Technically, what does this mean? My source file is just some text. Say I'm using Notepad++ to code a simple C# app; how the text is represented in bytes on disk, after I save the file, depends on N++, so that's probably not what people mean. Does that mean that:

  • The language specification requires/recommends that the compiler input be encoded as UTF-16?
  • The standard library functions are encoding-aware and treat the strings as UTF-16, for example String's operator [] (which returns the n-th character and not the n-th byte)?
  • Once the compiler produces an executable, the strings stored inside it are in UTF-16?

I've used C# as an example, but this question applies to any language of which one could say that it uses encoding Y for its strings.

user4520
  • 3,401
  • 1
  • 27
  • 50

2 Answers2

2

"C# uses UTF-16 for its strings"

As far as I understand this concept, this is a simplification at best. A CLI runtime (such as the CLR) is required to store strings it loads from assemblies or that are generated at runtime in UTF-16 encoding in memory - or at least present them as such to the rest of the runtime and the application.

See CLI specification:

III.1.1.3 Character data type

A CLI char type occupies 2 bytes in memory and represents a Unicode code unit using UTF-16 encoding. For the purpose of stack operations char values are treated as unsigned 2-byte integers (§III.1.1.1)

And C# specification:

4.2.4 The string type

Instances of the string class represent Unicode [being UTF-16 in .NET jargon] character strings.

I can't find that quickly which file encodings the C# compiler supports, but I'm quite sure you can have a source file stored in UTF-8 encoding, or even ASCII (or another non-unicode code page).

The standard library functions are encoding-aware and treat the strings as UTF-16

No, the BCL just treats strings as strings, being a wrapper around a char[] array. Only when transitioning outside the runtime, like in a P/Invoke call, the runtime "knows" which platform functions to invoke and how to marshal a string to those functions. See for example C++/CLI Converting from System::String^ to std::string

Once the compiler produces an [assembly], the strings are stored inside it in UTF-16?

Yes.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • `the BCL just treats strings as strings` - okay, but, for example, to implement the mentioned `operator[]` (not sure if I'm using the technically correct name here), we have to know the string's encoding; thus at least some code in the class needs to know how to interpret the class's contents, am I right? – user4520 Oct 14 '15 at 11:34
  • 1
    @szczurcio no, a string at runtime is guaranteed to be encoded in UTF-16, so the runtime nor the language have to be "aware" of the encoding. :) This knowledge is built into the `char` and `string` types, in the runtime itself, and exposed through the language. The encoding only comes into play when a string is to be exported outside of the "string scope" (being the .NET type), like in a byte array representing the character data or when marshalling to a platform function. – CodeCaster Oct 14 '15 at 11:38
  • Ah, that makes sense. Would you avoid statements claiming that a language uses one particular encoding for its strings as oversimplifying things, or does this make sense in the context of other (than C#) languages? Surely not the liberal C and C++ but maybe something else? – user4520 Oct 14 '15 at 11:41
  • 1
    @szczurcio I'm not _that_ well-versed in the specification jargon, so I wouldn't really consider myself an authoritive source for that. I would say that the C# language merely _exposes_, or _interfaces with_ the core principles and rules of the CLI (while adding syntactic sugar to certain concepts in order to make them more easily accessible for the programmer). So in a sense you could say _"C# uses UTF-16 for strings"_, but that's not C#'s doing, that's the CLI's. – CodeCaster Oct 14 '15 at 11:44
0

Let take a look at C/C++ char type. It is 8 bits long(1 byte). This means that it can store 255 different symbols. Now let's think what actually a font is. It is something like a map. Values from 0 to 255 (1 byte) are mapped to symbols. These type of fonts usually contains 2 type of characters (cyrillic and latin for example) and special symbols. There is no enough space (255 limit) to save greece or chinese letters.

Now let's see what is UTF-8. It is encoding, which stores some symbols using 8 bits and some using 16 bits. For example if you type in notepad word "word" and save file with UTF-8 encoding the resulting file will be exactly 4 bytes length, but if you type word "дума", which is again 4 symbols it will use 8 bytes on your storage. So some letters are stored as 1 byte, others as 2.

UTF-16 means that all symbols are stored in 2 bytes and logically UTF-32 = 4 bytes.

Let's see how this looks from programming sight. When you are typing symbols in notepad they are stored in RAM (in some format that notepad understands). When you save file on the disk notepad write a sequence of bytes on the disk. These sequence depends on chosen encoding. When you are reading (with C# or some other language) file you have to know its encoding. By knowing it you will know how to interpret the sequence written on the disk.

Kamen Stoykov
  • 1,715
  • 3
  • 17
  • 31
  • 2
    I think OP is not asking for a summary of [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html), but rather what it means for a language or runtime to be Unicode-aware. – CodeCaster Oct 14 '15 at 11:32
  • Indeed, this answer is somewhat offtopic. – user4520 Oct 14 '15 at 11:33