12

I've working with a legacy application and I'm trying to work out the difference between applications compiled with Multi byte character set and Not Set under the Character Set option.

I understand that compiling with Multi byte character set defines _MBCS which allows multi byte character set code pages to be used, and using Not set doesn't define _MBCS, in which case only single byte character set code pages are allowed.

In the case that Not Set is used, I'm assuming then that we can only use the single byte character set code pages found on this page: http://msdn.microsoft.com/en-gb/goglobal/bb964654.aspx

Therefore, am I correct in thinking that is Not Set is used, the application won't be able to encode and write or read far eastern languages since they are defined in double byte character set code pages (and of course Unicode)?

Following on from this, if Multi byte character set is defined, are both single and multi byte character set code pages available, or only multi byte character set code pages? I'm guessing it must be both for European languages to be supported.

Thanks,

Andy

Further Reading

The answers on these pages didn't answer my question, but helped in my understanding: About the "Character set" option in visual studio 2010

Research

So, just as working research... With my locale set as Japanese

Effect on hard coded strings

char *foo = "Jap text: テスト";
wchar_t *bar = L"Jap text: テスト";

Compiling with Unicode

*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2

Compiling with Multi byte character set

*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2

Compiling with Not Set

*foo = 4a 61 70 20 74 65 78 74 3a 20 83 65 83 58 83 67 == Shift-Jis (Code page 932)
*bar = 4a 00 61 00 70 00 20 00 74 00 65 00 78 00 74 00 3a 00 20 00 c6 30 b9 30 c8 30 == UTF-16 or UCS-2

Conclusion: The character encoding doesn't have any effect on hard coded strings. Although defining chars as above seems to use the Locale defined codepage and wchar_t seems to use either UCS-2 or UTF-16.

Using encoded strings in W/A versions of Win32 APIs

So, using the following code:

char *foo = "C:\\Temp\\テスト\\テa.txt";
wchar_t *bar = L"C:\\Temp\\テスト\\テw.txt";

CreateFileA(bar, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
CreateFileW(foo, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

Compiling with Unicode

Result: Both files are created

Compiling with Multi byte character set

Result: Both files are created

Compiling with Not set

Result: Both files are created

Conclusion: Both the A and W version of the API expect the same encoding regardless of the character set chosen. From this, perhaps we can assume that all the Character Set option does is switch between the version of the API. So the A version always expects strings in the encoding of the current code page and the W version always expects UTF-16 or UCS-2.

Opening files using W and A Win32 APIs

So using the following code:

char filea[MAX_PATH] = {0};
OPENFILENAMEA ofna = {0};
ofna.lStructSize = sizeof ( ofna );
ofna.hwndOwner = NULL  ;
ofna.lpstrFile = filea ;
ofna.nMaxFile = MAX_PATH;
ofna.lpstrFilter = "All\0*.*\0Text\0*.TXT\0";
ofna.nFilterIndex =1;
ofna.lpstrFileTitle = NULL ;
ofna.nMaxFileTitle = 0 ;
ofna.lpstrInitialDir=NULL ;
ofna.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;  

wchar_t filew[MAX_PATH] = {0};
OPENFILENAMEW ofnw = {0};
ofnw.lStructSize = sizeof ( ofnw );
ofnw.hwndOwner = NULL  ;
ofnw.lpstrFile = filew ;
ofnw.nMaxFile = MAX_PATH;
ofnw.lpstrFilter = L"All\0*.*\0Text\0*.TXT\0";
ofnw.nFilterIndex =1;
ofnw.lpstrFileTitle = NULL;
ofnw.nMaxFileTitle = 0 ;
ofnw.lpstrInitialDir=NULL ;
ofnw.Flags = OFN_PATHMUSTEXIST|OFN_FILEMUSTEXIST ;

GetOpenFileNameA(&ofna);
GetOpenFileNameW(&ofnw);

and selecting either:

  • C:\Temp\テスト\テopenw.txt
  • C:\Temp\テスト\テopenw.txt

Yields:

When compiled with Unicode

*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2

When compiled with Multi byte character set

*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2

When compiled with Not Set

*filea = 43 3a 5c 54 65 6d 70 5c 83 65 83 58 83 67 5c 83 65 6f 70 65 6e 61 2e 74 78 74 == Shift-Jis (Code page 932)
*filew = 43 00 3a 00 5c 00 54 00 65 00 6d 00 70 00 5c 00 c6 30 b9 30 c8 30 5c 00 c6 30 6f 00 70 00 65 00 6e 00 77 00 2e 00 74 00 78 00 74 00 == UTF-16 or UCS-2

Conclusion: Again, the Character Set setting doesn't have a bearing on the behaviour of the Win32 API. The A version always seems to return a string with the encoding of the active code page and the W one always returns UTF-16 or UCS-2. I can actually see this explained a bit in this great answer: https://stackoverflow.com/a/3299860/187100.

Ultimate Conculsion

Hans appears to be correct when he says that the define doesn't really have any magic to it, beyond changing the Win32 APIs to use either W or A. Therefore, I can't really see any difference between Not Set and Multi byte character set.

Community
  • 1
  • 1
Andy
  • 2,977
  • 2
  • 39
  • 71

2 Answers2

8

No, that's not really the way it works. The only thing that happens is that the macro gets defined, it doesn't otherwise have a magic effect on the compiler. It is very rare to actually write code that uses #ifdef _MBCS to test this macro.

You almost always leave it up to a helper function to make the conversion. Like WideCharToMultiByte(), OLE2A() or wctombs(). Which are conversion functions that always consider multi-byte encodings, as guided by the code page. _MBCS is an historical accident, relevant only 25+ years ago when multi-byte encodings were not common yet. Much like using a non-Unicode encoding is a historical artifact these days as well.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • So if I understand correctly, if I define a hard coded string, say char *foo = "テスト". How the string pointed to by foo is not defined by the character set setting? Maybe the encoding of the code file containing that line? (I'm trying to test these theories out at the moment) – Andy Jul 19 '13 at 11:45
  • That's going to force your text editor to choose an appropriate encoding for the source code file. In itself a source of accidents. If it picked a Unicode encoding, utf-8 is common, you are liable to get your compiler to get sulky about it. C4566 on my machine. Only ever consider writing it like this if you live in Japan and don't consider moving anytime soon. – Hans Passant Jul 19 '13 at 12:00
  • Ok, so it sounds like I'm understanding this a bit better now. The defines don't really do much, the code page is set on the machine regardless of how an application is compiled, and the defines just change the Win32 APIs and based on whether it's W or A, I guess it'd return page code (multibyte or singlebyte char set) encoded stuff (A) or UTF-16 (W)? – Andy Jul 19 '13 at 12:08
0

In the reference it is stated that:

By definition, the ASCII character set is a subset of all multibyte-character sets. In many multibyte character sets, each character in the range 0x00 – 0x7F is identical to the character that has the same value in the ASCII character set. For example, in both ASCII and MBCS character strings, the 1-byte NULL character ('\0') has value 0x00 and indicates the terminating null character.

As you guessed, by enabling _MBCS Visual Studio also supports ASCII single character set.

In a second reference, single character set seems to be supported even if we enable _MBCS:

MBCS/Unicode portability: Using the Tchar.h header file, you can build single-byte, MBCS, and Unicode applications from the same sources. Tchar.h defines macros prefixed with _tcs , which map to str, _mbs, or wcs functions, as appropriate. To build MBCS, define the symbol _MBCS. To build Unicode, define the symbol _UNICODE. By default, _MBCS is defined for MFC applications. For more information, see Generic-Text Mappings in Tchar.h.

fatihk
  • 7,789
  • 1
  • 26
  • 48
  • But by not using `_MBCS` aren't the API's using the single byte character set code page based on the locale, such as those defined at: http://msdn.microsoft.com/en-gb/goglobal/bb964654.aspx? So each of those all start with the ASCII range, but they go on to define other foreign characters too. – Andy Jul 19 '13 at 10:06
  • @Andy, Yes, ASCII is a 7 bit character set with 128 characters while single byte(8 bit) locale encodings may encode 256 characters. – fatihk Jul 19 '13 at 10:10
  • yup, so the questions still remain, if MBCS is defined, are single byte character set code pages excluded (and therefore say thai characters)? And if I compile without MBSC, I'm guessing the application wouldn't be able to handle far eastern characters since it's restricted to single byte character set code pages> – Andy Jul 19 '13 at 10:16
  • @Andy, according to the second reference, it seems to be supported – fatihk Jul 19 '13 at 10:29