byte representation of ASCII symbols in std::wstring with different locales

Question

Windows C++ app. We have a string that contain only ASCII symbols: std::wstring(L"abcdeABCDE ... any other ASCII symbol"). Note that this is std::wstring that uses wchar_t.

Question - do byte representation of this string depends on the localization settings, or something else? Can I assume that if I receive such string (for example, from WindowsAPI) while app is running its bytes will be the same as on the my PC?

You are using a Unicode string, encoded in utf-16 at runtime. It does not depend on locale, it is Unicode. Anything you get from the winapi will be Unicode as well with UNICODE #defined. String literals with non-ASCII characters do depend on your text editor saving the .cpp file in a Unicode encoding that the compiler can recognize, use utf-8 with a BOM so it doesn't turn into mojibake when your source code travels elsewhere. — Hans Passant, Jun 01 '16 at 22:12
@HansPassant Thank you for the tip! But I need only small range of ASCII symbols (a-zA-Z, space, dot). I don't have big experience with Windows/Unicode stuff therefore I asked to be 100% sure. Seems that I'm right and representation of these symbols do not depends on anything. — Victor Mezrin, Jun 01 '16 at 22:35
It might help to forget about ASCII. In the Win32 API, you are using Unicode/UTF-16. Almost nobody would use the whole Unicode character set so almost every program will use a subset of Unicode. It doesn't matter if the subset you use is also a subset of a character set you aren't using. — Tom Blodget, Jun 01 '16 at 23:36
@TomBlodget: the source code probably isn't UTF-16, so I think ASCII is still relevant here. — Harry Johnston, Jun 02 '16 at 03:30
@HarryJohnston Whatever it is, it's less likely ASCII than UTF-16. The user saves the file in one specific encoding and the compiler has to know what that is. My VS2015 seems to default to Windows-1252 when creating and saving, and lets the compiler default by what is likely the same convention. Not very controlled, which leads to Hans's recommendation. — Tom Blodget, Jun 02 '16 at 04:22
@TomBlodget: I think *most* ASCII characters are common to all Windows locales, aren't they? I know there are a few exceptions, but dammit, you shouldn't have to worry about what your source code encoding is. If you have a constant string with anything outside of the safe character set - which is certainly a subset of ASCII - the best option IMO is to use an escape sequence. — Harry Johnston, Jun 02 '16 at 04:43
@TomBlodget: [this is interesting](https://msdn.microsoft.com/en-us/library/bt0y4awe.aspx): "The source character set of C source programs is contained within the 7-bit ASCII character set". I don't think it's actually true (?) but it's interesting. :-) — Harry Johnston, Jun 02 '16 at 04:44

score 1 · Accepted Answer · edited May 23 '17 at 12:31

In general, for characters (not escape sequence) wchar_t and wstring have to use the same codes as ASCII (just extended to 2 bytes). But I am not sure about codes less then 32 and of course codes greater than 128 can has different meaning (as in ASCII) in the moment of output, so to avoid problem on output set particular locale explicitly, e.g.:

  locale("en_US.UTF-8")

for standard output

  wcout.imbue(locale("en_US.UTF-8"));

UPDATE:

I found one more suggestion about adding

  std::ios_base::sync_with_stdio(false);

before setting localization with imbue

see details on How can I use std::imbue to set the locale for std::wcout?

Thank you!!! I need only narrow range of ASCII symbols - characters a-zA-Z and several special symbols like space, dot etc. Now I am more confident )) — Victor Mezrin, Jun 01 '16 at 22:17

score 1 · Answer 2 · edited May 23 '17 at 12:24

The byte representation of the literal string does not depend on the environment. It's hardcoded to the binary data from the editor. However, the way that binary data is interpreted depends on the current code page, so you can end up with different results when converted at runtime to a wide string (as opposed to defining the string using a leading L, which means that the wide characters will be set at compile time.)

To be safe, use setlocale() to guarantee the encoding used for conversion. Then you don't have to worry about the environment.

This might help: "By definition, the ASCII character set is a subset of all multibyte-character sets. In many multibyte character sets, each character in the range 0x00 – 0x7F is identical to the character that has the same value in the ASCII character set. For example, in both ASCII and MBCS character strings, the 1-byte NULL character ('\0') has value 0x00 and indicates the terminating null character."

From: Visual Studio Character Sets 'Not set' vs 'Multi byte character set'

byte representation of ASCII symbols in std::wstring with different locales

2 Answers2