6

ADDENDUM A tentative answer of my own appears at the bottom of the question.


I am converting an archaic VC6 C++/MFC project to VS2013 and Unicode, based on the recommendations at utf8everywhere.org.

Along the way, I have been studying Unicode, UTF-16, UCS-2, UTF-8, the standard library and STL support of Unicode & UTF-8 (or, rather, the standard library's lack of support), ICU, Boost.Locale, and of course the Windows SDK and MFC's API that requires UTF-16 wchar's.

As I have been studying the above issues, a question continues to recur that I have not been able to answer to my satisfaction in a clarified way.

Consider the C library function mbstowcs. This function has the following signature:

size_t mbstowcs (wchar_t* dest, const char* src, size_t max);

The second parameter src is (according to the documentation) a

C-string with the multibyte characters to be interpreted. The multibyte sequence shall begin in the initial shift state.

My question is in regards to this multibyte string. It is my understanding that the encoding of a multibyte string can differ from string to string, and the encoding is not specified by the standard. Nor does a particular encoding seem to be specified by the MSVC documentation for this function.

My understanding at this point is that on Windows, this multibyte string is expected to be encoded with the ANSI code page of the active locale. But my clarity begins to fade at this point.

I have been wondering whether the encoding of the source code file itself makes a difference in the behavior of mbstowcs, at least on Windows. And, I'm also confused about what happens at compile time vs. what happens at run time for the code snippet above.

Suppose you have a string literal passed to mbstowcs, like this:

wchar_t dest[1024];
mbstowcs (dest, "Hello, world!", 1024);

Suppose this code is compiled on a Windows machine. Suppose that the code page of the source code file itself is different than the code page of the current locale on the machine on which the compiler runs. Will the compiler take into consideration the source code file's encoding? Will the resulting binary be effected by the fact that the code page of the source code file is different than the code page of the active locale on which the compiler runs?

On the other hand, maybe I have it wrong - maybe the active locale of the runtime machine determines the code page that is expected of the string literal. Therefore, does the code page with which the source code file is saved need to match the code page of the computer on which the program ultimately runs? That seems so whacked to me that I find it hard to believe this would be the case. But as you can see, my clarity is lacking here.

On the other hand, if we change the call to mbstowcs to explicitly pass a UTF-8 string:

wchar_t dest[1024];
mbstowcs (dest, u8"Hello, world!", 1024);

... I assume that mbstowcs will always do the right thing - regardless of the code page of the source file, the current locale of the compiler, or the current locale of the computer on which the code runs. Am I correct about this?

I would appreciate clarity on these matters, in particular in regards to the specific questions I have raised above. If any or all of my questions are ill-formed, I would appreciate knowing that, as well.


ADDENDUM From the lengthy comments beneath @TheUndeadFish's answer, and from the answer to a question on a very similar topic here, I believe I have a tentative answer to my own question that I'd like to propose.

Let's follow the raw bytes of the source code file to see how the actual bytes are transformed through the entire process of compilation to runtime behavior:

  • The C++ standard 'ostensibly' requires that all characters in any source code file be a (particular) 96-character subset of ASCII called the basic source character set. (But see following bullet points.)

    In terms of the actual byte-level encoding of these 96 characters in the source code file, the standard does not specify any particular encoding, but all 96 characters are ASCII characters, so in practice, there is never a question about what encoding the source file is in, because all encodings in existence represent these 96 ASCII characters using the same raw bytes.

  • However, character literals and code comments might commonly contain characters outside these basic 96.

    This is typically supported by the compiler (even though this isn't required by the C++ standard). The source code's character set is called the source character set. But the compiler needs to have these same characters available in its internal character set (called the execution character set), or else those missing characters will be replaced by some other (dummy) character (such as a square or a question mark) prior to the compiler actually processing the source code - see the discussion that follows.

    How the compiler determines the encoding that is used to encode the characters of the source code file (when characters appear that are outside the basic source character set) is implementation-defined.

    Note that it is possible for the compiler to use a different character set (encoded however it likes) for its internal execution character set than the character set represented by the encoding of the source code file!

    This means that even if the compiler knows about the encoding of the source code file (which implies that the compiler also knows about all the characters in the source code's character set), the compiler might still be forced to convert some characters in the source code's character set to different characters in the execution character set (thereby losing information). The standard states that this is acceptable, but that the compiler must not convert any characters in the source character set to the NULL character in the execution character set.

    Nothing is said by the C++ standard about the encoding used for the execution character set, just as nothing is said about the characters that are required to be supported in the execution character set (other than the characters in the basic execution character set, which include all characters in the basic source character set plus a handful of additional ones such as the NULL character and the backspace character).

    It is not really seemingly documented anywhere very clearly, even by Microsoft, how any of this process is handled in MSVC. I.e., how the compiler figures out what the encoding and corresponding character set of the source code file is, and/or what the choice of execution character set is, and/or what the encoding is that will be used for the execution character set during compilation of the source code file.

    It seems that in the case of MSVC, the compiler will make a best-guess effort in its attempt to select an encoding (and corresponding character set) for any given source code file, falling back on the current locale's default code page of the machine the compiler is running on. Or you can take special steps to save the source code files as Unicode using an editor that will provide the proper byte-order mark (BOM) at the beginning of each source code file. This includes UTF-8, for which the BOM is typically optional or excluded - in the case of source code files read by the MSVC compiler, you must include the UTF-8 BOM.

    And in terms of the execution character set and its encoding for MSVC, continue on with the next bullet point.

  • The compiler proceeds to read the source file and converts the raw bytes of the characters of the source code file from the encoding for the source character set into the (potentially different) encoding of the corresponding character in the execution character set (which will be the same character, if the given character is present in both character sets).

    Ignoring code comments and character literals, all such characters are typically in the basic execution character set noted above. This is a subset of the ASCII character set, so encoding issues are irrelevant (all of these characters are, in practice, encoded identically on all compilers).

    Regarding the code comments and character literals, though: the code comments are discarded, and if the character literals contain only characters in the basic source character set, then no problem - these characters will belong in the basic execution character set and still be ASCII.

    But if the character literals in the source code contain characters outside of the basic source character set, then these characters are, as noted above, converted to the execution character set (possibly with some loss). But as noted, neither the characters, nor the encoding for this character set is defined by the C++ standard. Again, the MSVC documentation seems to be very weak on what this encoding and character set will be. Perhaps it is the default ANSI encoding indicated by the active locale on the machine on which the compiler runs? Perhaps it is UTF-16?

  • In any case, the raw bytes that will be burned into the executable for the character string literal correspond exactly to the compiler's encoding of the characters in the execution character set.

  • At runtime, mbstowcs is called and it is passed the bytes from the previous bullet point, unchanged.

    It is now time for the C runtime library to interpret the bytes that are passed to mbstowcs.

    Because no locale is provided with the call to mbstowcs, the C runtime has no idea what encoding to use when it receives these bytes - this is arguably the weakest link in this chain.

    It is not documented by the C++ (or C) standard what encoding should be used to read the bytes passed to mbstowcs. I am not sure if the standard states that the input to mbstowcs is expected to be in the same execution character set as the characters in the execution character set of the compiler, OR if the encoding is expected to be the same for the compiler as for the C runtime implementation of mbstowcs.

    But my tentative guess is that in the MSVC C runtime, apparently the locale of the current running thread will be used to determine both the runtime execution character set, and the encoding representing this character set, that will be used to interpret the bytes passed to mbstowcs.

    This means that it will be very easy for these bytes to be mis-interpreted as different characters than were encoded in the source code file - very ugly, as far as I'm concerned.

    If I'm right about all this, then if you want to force the C runtime to use a particular encoding, you should call the Window SDK's MultiByteToWideChar, as @HarryJohnston's comment indicates, because you can pass the desired encoding to that function.

  • Due to the above mess, there really isn't an automatic way to deal with character literals in source code files.

    Therefore, as https://stackoverflow.com/a/1866668/368896 mentions, if there's a chance you'll have non-ASCII characters in your character literals, you should use resources (such as GetText's method, which also works via Boost.Locale on Windows in conjunction with the xgettext .exe that ships with Poedit), and in your source code, simply write functions to load the resources as raw (unchanged) bytes.

    Make sure to save your resource files as UTF-8, and then make sure to call functions at runtime that explicitly support UTF-8 for their char *'s and std::string's, such as (from the recommendations at utf8everywhere.org) using Boost.Nowide (not really in Boost yet, I think) to convert from UTF-8 to wchar_t at the last possible moment prior to calling any Windows API functions that write text to dialog boxes, etc. (and using the W forms of these Windows API functions). For console output, you must call the SetConsoleOutputCP-type functions, such as is also described at https://stackoverflow.com/a/1866668/368896.

Thanks to those who took the time to read the lengthy proposed answer here.

Community
  • 1
  • 1
Dan Nissenbaum
  • 13,558
  • 21
  • 105
  • 181
  • 2
    The `mbstowcs` you are using is documented at [mbstowcs](http://msdn.microsoft.com/en-us/library/k1f9b8cy.aspx). The `src` string is interpreted using the calling thread's locale. To get reliable results you can either set the calling thread's local, or use Microsoft's extension `_mbstowcs_l`, taking a locale parameter. – IInspectable Jan 09 '15 at 23:53
  • @IInspectable - will the bytes interpreted according to the calling thread's locale be identical to the bytes in the source code file representing the string literal, do you think? (This would mean, i.e., that if the locale of the *machine on which the code runs* corresponds to a code page that is *different* from the code page with which the *source code* is saved on the machine on which the code is *compiled*, the behavior might not be what the programmer intended. If I'm right, perhaps for this reason using the `u8` cast is a good idea - assuming that does the 'right' thing!) – Dan Nissenbaum Jan 10 '15 at 00:03
  • 1
    (1) [According to this](http://msdn.microsoft.com/en-us/library/69ze775t.aspx) VS2013 doesn't support u8"" string literals anyway? (2) The MultiByteToWideChar function is probably safer than mbstowcs, since it lets you explicitly specify the source encoding. (3) It is probably safest for your source code to contain only ASCII characters, because non-ASCII characters might be treated differently by the compiler you'll be using next year than by the one you're using now; UTF-8 or UTF-16 string literals can be built with \x syntax. – Harry Johnston Jan 10 '15 at 01:50
  • This StackOverflow answer provides the *best* explanation of what happens to the **raw bytes** of the source code file through the process of compilation and into the runtime system that I have yet found: http://stackoverflow.com/a/1866668/368896 – Dan Nissenbaum Jan 10 '15 at 02:39
  • A comment on your immediate goal (converting an MFC application to Unicode): Since the Windows API (and MFC) uses UTF-16 throughout, the best advice is to keep everything UTF-16 inside your application. UTF-8 should be used only when data leaves your application (e.g. serializing data to a file, or a network socket). If you keep your internal data UTF-8 encoded, you'll find yourself constantly converting from and to UTF-16. You cannot throw a UTF-8 encoded string at the ANSI version of a Windows API and hope for anything sane to happen. – IInspectable Jan 10 '15 at 13:33
  • @IInspectable Thanks. I've just posted a related question here: http://stackoverflow.com/questions/27880344/flow-of-raw-bytes-of-string-literal-into-out-of-the-windows-non-wide-execution – Dan Nissenbaum Jan 10 '15 at 19:32
  • "the Windows API documentation for this function [mbstowcs}" - Check the top of the page. This is **not** the Windows API. This is Visual Studio. Same company, but different product. – MSalters Jan 10 '15 at 20:11
  • @MSalters Thanks. I've corrected it to read `The MSVC documention for this function`. – Dan Nissenbaum Jan 10 '15 at 20:13
  • 1
    @DanNissenbaum: Just to clarify, the comment wasn't so much textual, as aiming to point out that there are two products here (OS and compiler) which have not entirely identical views of characters. – MSalters Jan 10 '15 at 20:15
  • @MSalters Thanks. Thinking that over for a bit, it occurs to me that the relevant **OS** part of the storage and interpretation of non-ASCII characters in a binary file is the C runtime DLL's. – Dan Nissenbaum Jan 10 '15 at 20:21
  • @DanNissenbaum: Ehm, no? The C runtime DLL is _not_ part of the OS. – MSalters Jan 10 '15 at 20:32
  • @MSalters That might be considered debatable, as they are DLL's and I believe certain versions of the CRT DLL's are shipped with the OS. That is to say, it is my sense that the definition of what is "part of the OS" vs. what is "not part of the OS" is not formally defined. In terms of the *specifics* of the storage and interpretation of character strings in the (non-wide) `execution character set`, it's my understanding that the compiler, the CRT, and the OS choice of locale (perhaps via the user's language settings) are relevant factors. – Dan Nissenbaum Jan 10 '15 at 20:43
  • @DanNissenbaum: Visual Studio 6 and earlier (long out of support) use C runtime DLLs shipped with the OS. No current version of Visual Studio does. Even the shipped C runtime isn't part of the Win32 API, so there is a clear distinction. The C runtime of course *uses* the Win32 API. (It is also possible to write code that doesn't use the C runtime.) *Caveat:* there are functions documented in the Win32 API that are actually part of the C runtime; carelessness on Microsoft's part. So you're not entirely wrong. :-) – Harry Johnston Jan 10 '15 at 22:02
  • @HarryJohnston I didn't know that the redistributables of versions of VS more recent than VC6 are not included with any of the Windows OS's. (I did know that sometimes the redistributable for certain versions of VS needs to be installed, because I've built Windows installers for code written in VS, but that's as far as I've considered it up until now.) Thanks! – Dan Nissenbaum Jan 10 '15 at 22:06
  • @Dan: Windows 8 shipped prior to the release of Visual Studio 2013, for example. It cannot possibly contain the CRT that ships with Visual Studio 2013. The CRT is part of Visual Studio and owned by the Visual Studio team, not the OS folks. – IInspectable Jan 10 '15 at 23:32
  • @IInspectable That is clear. HarryJohnston says (if I understand him correctly) that no CRT more recent than VC6 has ever shipped with any version of Windows. My comment was in reference to any and all such redistributables, such as that for VS2008, VS2010, etc - in the context of whether the CRT might be considered "part of the OS". It does surprise me that the OS folks do not use any version of the CRT as distributed with any version of Visual Studio (assuming that they aren't using VC6's CRT, which I assume is a given), so I'm happy to learn this! – Dan Nissenbaum Jan 11 '15 at 00:51
  • @Dan: I was trying to explain that there would be no point in bundling the CRTs with the OS. The example I gave illustrates, that no application can ever rely on finding an appropriate CRT on a target system, and must always ship one anyway. Since the CRT is built on top of the OS, the OS cannot use it, obviously. OS tools built on top of the OS (like explorer.exe) apparently use their own CRT implementation with a file description of *Windows NT CRT DLL*. – IInspectable Jan 11 '15 at 01:30
  • @IInspectable Good to know. Thanks. (I had always imagined without ever looking into it that the OS components used some particular version of the MSVC CRT or earlier, and that applications only needed to install the redistributable for later versions. Now I understand better!) – Dan Nissenbaum Jan 11 '15 at 01:36

2 Answers2

5

The encoding of the source code file doesn't affect the behavior of mbstowcs. After all, the internal implementation of the function is unaware of what source code might be calling it.

On the MSDN documentation you linked is:

mbstowcs uses the current locale for any locale-dependent behavior; _mbstowcs_l is identical except that it uses the locale passed in instead. For more information, see Locale.

That linked page about locales then references setlocale which is how the behavior of mbstowcs can be affected.

Now, taking a look at your proposed way of passing UTF-8:

mbstowcs (dest, u8"Hello, world!", 1024);

Unfortunately, that isn't going to work properly as far as I know once you use interesting data. If it even compiles, it only does do because the compiler would have to be treating u8 the same as a char*. And as far as mbstowcs is concerned, it will believe the string is encoded under whatever the locale is set for.

Even more unfortunately, I don't believe there's any way (on the Windows / Visual Studio platform) to set a locale such that UTF-8 would be used.

So that would happen to work for ASCII characters (the first 128 characters) only because they happen to have the exact same binary values in various ANSI encodings as well as UTF-8. If you try with any characters beyond that (for instance anything with an accent or umlaut) then you'll see problems.


Personally, I think mbstowcs and such are rather limited and clunky. I've found the Window's API function MultiByteToWideChar to be more effective in general. In particular it can easily handle UTF-8 just by passing CP_UTF8 for the code page parameter.

TheUndeadFish
  • 8,058
  • 1
  • 23
  • 17
  • Thanks. Quick question - by `the current locale` - does this mean the current locale of the **compiler**, or the current locale of the system on which the program **runs**? – Dan Nissenbaum Jan 10 '15 at 00:11
  • 1
    Actually, the current locale of the C Runtime in the program. Since it's inside the program's memory, that's how `setlocale` is able to change it. – TheUndeadFish Jan 10 '15 at 00:13
  • 2
    The CRT that ships with Visual Studio manages the locale per thread, not per process. *The currently active locale* refers to the thread's locale, on which the code runs. – IInspectable Jan 10 '15 at 00:16
  • @DietmarKühl and others - am I correct, then, that if the code page of the *source code file* differs from the code page that the CRT uses when the program **runs**, that the results are likely not to be what the programmer expects (**even** if the code is cast to a UTF-8 string in the source code)? – Dan Nissenbaum Jan 10 '15 at 00:21
  • 1
    Ah yes, I was being sloppy about the thread aspect. Although it seems there is a `_configthreadlocale` function which can set whether `setlocale` affects only the current thread or all threads. – TheUndeadFish Jan 10 '15 at 00:21
  • @DanNissenbaum Correct Dan, the source code encoding has no relevance and mbstowcs doesn't handle UTF-8 anyway. Try testing with something like `u8"Ä"` or `u8"Φ"`. – TheUndeadFish Jan 10 '15 at 00:28
  • TheUndeadFish - Thanks. In my thinking, I'm saying the same thing, I think, by saing that the source code encoding **does** have relevance - in that the source code encoding determines the sequence of bytes that will be stored in the program, and it is this exact sequence of bytes that will be interpreted by the CRT at *runtime* according to to the active *runtime* locale's code page. I'd like to confirm that I'm correct. (And that the wording - whether the source code encoding *does* vs. *does not* have relevance - is just a matter of how you want to think about it, but is irrelevant.) – Dan Nissenbaum Jan 10 '15 at 00:33
  • @DietmarKühl - by `target encoding`, are you referring to the encoding of the active locale used by the **compiler** at compile time? – Dan Nissenbaum Jan 10 '15 at 00:37
  • Well how ever you want to say it all, mbstowcs doesn't know about the encoding of the source file. It only sees the series of chars passed to it. So if the input is in a different encoding than it expects (based on runtime factors as mentioned), then things can be incorrectly converted. – TheUndeadFish Jan 10 '15 at 00:39
  • UndeadFish - Thanks. I guess I still want to be certain I understand what `the series of chars passed to it` will be. Will these be the **exact same** sequence of chars as stored in the raw bytes of the source code file, or will they be converted by the **compiler**? – Dan Nissenbaum Jan 10 '15 at 00:40
  • For bare strings like `"Hello"` I believe the Visual Studio compiler stores them according to the ANSI code page corresponding to the current regional/language settings of the system on which the compiler is running. For `u8` they should always be stored as UTF-8 regardless. And similarly wide strings (like `L"Hello"`) will always be UTF-16 under Visual Studio. – TheUndeadFish Jan 10 '15 at 00:44
  • TheUndeadFish and @DietmarKühl - I'm having a tricky time reconciling what DietmarKühl is saying about the `execution character set` (which I take to be determined at **runtime**), with what TheUndeadFish is saying about the `compiler storing them according to the ANSI code page`... it seems DietmarKühl is referring to the *runtime* system determining how the bytes are interpreted, whereas TheUndeadFish is referring to the *compiler* making (at least part of) the determination. Thank you both for taking the time to discuss. – Dan Nissenbaum Jan 10 '15 at 00:48
  • If I'm understanding it correctly the "execution character set" is something like the encoding of character data inside the executable itself. And that means it's the compiler which determines that since it the one who builds the executable. So it's not something that would change during runtime. – TheUndeadFish Jan 10 '15 at 00:50
  • TheUndeadFish - I would expect that the bytes of the string literal, as burned into the executable, would not change during runtime; however, it is my understanding that the *interpretation* of these bytes *is* determined at runtime by the active locale of the thread (at runtime). At the same time, I'd like to know if the *compiler* may convert the *raw bytes* from the source code file into *different* bytes that are burned into the executable. – Dan Nissenbaum Jan 10 '15 at 00:53
  • As far as I know, yes indeed the compiler can convert from the source code file to what it stores into the executable. After all you could have source in UTF-8 or UTF-16 but a bare `"Hello"` string should still end up as an ANSI encoding as far as the executable code is concerned. In fact, if you have a string literal with a character that can't be represented in that ANSI encoding, I think the Visual Studio compiler may even issue a warning. – TheUndeadFish Jan 10 '15 at 00:58
  • TheUndeadFish - If the compiler can convert the raw bytes in the *source code* file to a different set of bytes burned into the executable representing the string literal, do you know what determines the rules by which the VS compiler will make such a conversion? – Dan Nissenbaum Jan 10 '15 at 01:02
  • @DietmarKühl - Thanks. I am currently studying up on execution character sets. Regarding this particular StackOverflow question - it would be nice if somebody were able to take the time to actually spell out what happens to the raw bytes of the string literal, starting with the raw bytes of the source code file, working through the compiler reading those raw bytes and converting to the source and then the execution character set, through to the bytes burned into the executable, and finally what happens to those bytes at runtime with `mbstowcs`. I know that's a lot to ask! – Dan Nissenbaum Jan 10 '15 at 01:39
  • @DietmarKühl and TheUndeadFish - The following StackOverflow answer provides the *best* explanation of what happens to the **raw bytes** of the source code file through the process of compilation and into the runtime system that I have yet found: http://stackoverflow.com/a/1866668/368896 – Dan Nissenbaum Jan 10 '15 at 02:40
  • @DietmarKühl and TheUndeadFish - I have added an addendum to my answer which attempts to follow the raw bytes of a string literal through the entire process, from the source code file to the CRT. If you care to and have time, please have a look and tell me if you think I am on the right track. – Dan Nissenbaum Jan 10 '15 at 03:59
  • @DietmarKühl I am now reading the question & answer you've just created & posted - Thanks! – Dan Nissenbaum Jan 10 '15 at 04:19
1

mbstowcs() semantics are defined in terms of the currently installed C locale. If you are processing string with different encodings you will need to use setlocale() to change what encoding is currently being used. The relevant statement in the C standard is in 7.22.8 paragraph 1:

The behavior of the multibyte string functions is affected by the LC_CTYPE category of the current locale.

I don't know enough about the C library but as far as I know none of these functions is really thread-safe. I consider it much easier to deal with different encodings and, in general, cultural conventions, using the C++ std::locale facilities. With respect to encoding conversions you'd look at the std::codecvt<...> facets. Admittedly, these aren't easy to use, though.

The current locale needs a bit of clarification: the program has a current global locale. Initially, this locale is somehow set up by the system and is possibly controlled by the user's environment in some form. For example, on UNIX system there are environment variables which choose the initial locale. Once the program is running, it can change the current locale, however. How that is done depends a bit on what is being used exactly: a running C++ program actually has two locales: one used by the C library and one used by the C++ library.

The C locale is used for all locale dependent function from the C library, e.g., mbstowcs() but also for tolower() and printf(). The C++ locale is used for all locale dependent function which are specific to the C++ library. Since C++ uses locale objects the global locale is just used as the default for entities not setting a locale specifically, and primarily for the stream (you'd set a stream's locale using s.imbue(loc)). Depending on which locale you set, there are different methods to set the global locale:

  1. For the C locale you use setlocale().
  2. For the C++ locale you use std::locale::global().
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380