2

Here's the thing (https://learn.microsoft.com/en-us/cpp/build/reference/source-charset-set-source-character-set) that I know all about VC++ /source-charset and /execution-charset.

So there are 3 things I need to keep the same (if anything wrong, please correct me):

  1. source file encoding
  2. the /source-charset setting (determine how the compiler would interpret my source file)
  3. the /execution-charset setting (determine how the compiler would interpret "the output stuff" from stage 2 into executable.

So, if I save source file with encodingA, set /source-charset and /execution-charset as encodingA, and have code wchar_t c = L'é'; or char16_t c = u'é'; or char32_t c = U'é',

will the program change the code unit of é depending on encodingA I choose during the "interpreting"?

Or é's code unit would never change no matter what encoding I choose?

(Don't concern about the console output)

jwfearn
  • 28,781
  • 28
  • 95
  • 122
Rick
  • 7,007
  • 2
  • 49
  • 79
  • 2
    [Welcome to hell.](https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11) – user4581301 May 31 '18 at 19:52
  • @user4581301 My question is as difficult as that one...? – Rick May 31 '18 at 19:58
  • I'm not sure. I misread the question. In raw C++ things are sketchy, but clearly you are after a Visual Studio specific answer. My MSVC is weak, so I'm going to shut up and walk away. – user4581301 May 31 '18 at 20:06
  • Maybe a bit off-topic, but if you want to store non-ASCII characters in your source files then I strongly recommend storing them as UTF-8. UTF-16 (which is what Visual Studio tends to use unless you tell it otherwise, for example) will cause all kinds of grief with third party tools that don't understand it. Your files will also be smaller. UTF-16 as a data interchange format is really only used in Redmond. – Paul Sanders Jun 01 '18 at 05:12
  • @PaulSanders Yes, I know that. But VS uses system region related code page encoding, not UTF-16 I kinda strongly believe that. No mainstream platform is using UTF-16 as text file default encoding I think. – Rick Jun 01 '18 at 05:40
  • Saving files as UTF-8 is a supported encoding in Visual Studio, see [here](https://msdn.microsoft.com/en-us/library/dxfdkfke.aspx?f=255&MSPPError=-2147217396). I _think_ it puts a BOM in the file so it will recognise it as UTF-8 when it next opens it but you'd have to test that. Personally, I play it safe and stick to ASCII in my source files but VS once saved a .RC file as UTF-16 without me knowing and it played merry hell with my source code control system. Ugh. – Paul Sanders Jun 01 '18 at 06:07

2 Answers2

6

/source-charset dictates how Unicode is stored as bytes in your source file on disk, nothing more. The code editor knows é is Unicode codepoint U+00E9 and will encode it to file accordingly (0xE9 in Latin-1, 0xC3 0xA9 in UTF-8, etc).

When the compiler then reads the source file, it converts the file's bytes to Unicode using the specified /source-charset, and then processes Unicode data as needed. At this stage, provided the correct /source-encoding is used so the file's bytes are decoded properly, the é is read back in as Unicode codepoint U+00E9, and is not handled in any particular encoding until the next step.

The /execution-charset dictates what encoding Unicode data is saved as in the executable if no other encoding is specified in the code. It does not apply in your examples because the L/u/U prefixes dictate the encoding (L = UTF-16 or UTF-32, depending on platform, u = UTF-16, U = UTF-32). So:

wchar_t wc = L'é'; // 0xE9 0x00 or 0xE9 0x00 0x00 0x00

char16_t c16 = u'é'; // 0xE9 0x00

char32_t c32 = U'é'; // 0xE9 0x00 0x00 0x00

Were you using char instead, then /execution-charset would apply:

char c = 'é';  // MAYBE 0xE9 or other single-byte value, or a multi-byte overflow warning/error

const char *s = "é";  // MAYBE 0xE9 or other single-byte value, or maybe 0xC3 0xA9

Unless you use the u8 prefix for UTF-8:

char c = u8'é'; // illegal!

const char *s8 = u8"é",  // 0xC3 0xA9
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • very helpful! In some way prove my deduction. One more thing I would like to ask for clarification: So in the executable, `L'e'` or `u` or `U` will still be `0xE9 0x00 or 0xE9 0x00 0x00 0x00` and this internal code units would **never** be affected no matter what codepage(charset) is used, it's "hard-coded" **from beginning to end**, is that right? – Rick Jun 01 '18 at 06:48
  • Btw, are `/source-charset` and `/execution-charset` settings available for all compilers or VC++ compiler only? – Rick Jun 01 '18 at 06:51
  • `L` stands for "whatever encoding the compiler chooses for wchar_t", and formally that could even be ASCII or ISO-8859-x. In the context of this question, MSVC will always choose UTF-16. – MSalters Jun 01 '18 at 11:50
  • @MSalters `wchar_t` is primarily either 2 bytes (Windows) or 4 bytes (most other platforms). No compiler that uses those sizes would pick those charsets for it – Remy Lebeau Jun 01 '18 at 14:33
2

When you write wchar_t c = L'é'; in the source file, it needs to be converted to raw bytes somehow, and the encoding you use when saving the source file will influence the encoding of é.

Obviously the encoding you used to store the source file should match the compiler's source charset setting. The compiler literally reads your source file and interprets its contents based on the configured encoding.

Like if you saved 'é' in UTF-8 and read back in ISO-8859-1, you'd see 'é'.

But if you saved 'é' in ISO-8859-1 and read back in UTF-8, you'd get either a bad encoding error or a fallback to some other encoding.

It depends on what non-ASCII characters you use in your source files. If only latin-1, then it's best to store the source in Windows-1252 (or whatever the default encoding is for your locale) because MSVC defaults the source charset to that when no BOM is present. Then you won't need to specify any /source-charset.

If you use not only latin characters, or you want maximum portability, the best would be to use UTF-8 and pass /utf-8 flag to cl.exe, which is a shorthand for /source-charset:utf-8 /execution-charset:utf-8.

rustyx
  • 80,671
  • 25
  • 200
  • 267