1

My question seems to have confused folks. Here's something concrete:

Our code does the following:

FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);

Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).

Under UNICODE build target, the above produces a garbage file full of ????.

I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.

Here's the question as it used to exist:

FILE* streams can be opened in t(ranslated) or b(inary) modes. Desktop applications can be compiled for UNICODE or MBCS (under Windows).

If my application is compiled for MBCS, then writing MBCS strings to a "wt" stream results in a well-formed text file containing MBCS text for the system code page (i.e. the code page "for non Unicode software").

Because our software generally uses the _t versions of most string & stream functions, in MBCS builds output is handled primarily by puts(pszMBString) or something similar putc etc. Since pszMBString is already in the system code page (e.g. 932 when running on a Japanese machine), the string is written out verbatim (although line terminators are massaged by puts and gets automatically).

However, if my application is compiled for UNICODE, then writing MBCS strings to a "wt" stream results in garbage (lots of "?????" characters) (i.e. I convert the UNICODE to the system's default code page and then write that to the stream using, for example, fwrite(pszNarrow, 1, length, stream)).


I can open my streams in binary mode, in which case I'll get the correct MBCS text... but, the line terminators will no longer be PC-style CR+LF, but instead will be UNIX-style LF only. This, because in binary (non-translated) mode, the file stream doesn't handle the LF->CR+LF translation.


But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct line terminators and MBCS text files using the system's code page.

Obviously I can manually adjust the line terminators myself and use binary streams. However, this is a very invasive approach, as I now have to find every bit of code throughout the system that writes text files, and alter it so that it does all of this correctly. What blows my mind, is that UNICODE target is stupider / less capable than the MBCS target we used to use! Surely there is a way to toggle the C library to say "output narrow strings as-is but handle line terminators properly, exactly as you'd do in MBCS builds"?!

Mordachai
  • 9,412
  • 6
  • 60
  • 112
  • What is your plan if you need to write a Unicode character that isn't representable in the current MBCS encoding? If you absolutely need to stick with MBCS, why bother compiling with `UNICODE` (and presumably `_UNICODE`) at all? Or why not call the "ANSI" versions of functions directly? Personally I'd switch to UTF-8 as a data file format and provide a migration tool to convert existing data. – jamesdlin Aug 21 '13 at 21:06
  • Please provide a short code example. – dalle Aug 21 '13 at 21:23
  • Are you using the Windows-specific _t-function-macros, such as `_tfopen` and `_fputts` or the `fopen`/`_wfopen` and `fputs`/`fputws` functions? – dalle Aug 21 '13 at 21:38
  • What method are you using to convert your wide strings to your current code page? – dalle Aug 21 '13 at 21:43
  • Do not write MBCS files; use UTF-8 files instead. See http://utf8everywhere.org – Pavel Radzivilovsky Aug 21 '13 at 23:25
  • @jamesdlin "What is your plan if you need to write a Unicode character that isn't representable in the current MBCS encoding" -- our software lived within the system code page for all versions prior. It can live within those confines for one more release cycle, until I change necessary parts for UNICODE file i/O – Mordachai Aug 22 '13 at 19:22
  • @dalle - using the _t functions, and typically using CStringA(widestring) to auto-convert to MBCS (system code page). We `#define _CONVERSION_DONT_USE_THREAD_LOCALE` in order to ensure we use the system code page and not something else – Mordachai Aug 22 '13 at 19:23
  • @Pavel - that's a cool way to go about things. We, however, already have legacy installs where the files are encoded using the local system code page, and must be able to read & write such files. – Mordachai Aug 22 '13 at 19:29
  • Also see http://stackoverflow.com/questions/1509277/why-does-wide-file-stream-in-c-narrow-written-data-by-default – dalle Aug 25 '13 at 15:13

3 Answers3

4

Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).

My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC. Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).

BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!

Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.

If using MFC, one needs to:

// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE  

For C++ and C libraries, one must tell the libraries to use the system code page:

// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));

I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).

The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).

One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).

Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software

Community
  • 1
  • 1
Mordachai
  • 9,412
  • 6
  • 60
  • 112
  • Whomever down voted this needs to say why. You're welcome to your opinion, but you at least need to say why you think this is a bad answer, otherwise you're leaving myself and any future readers with a negative sense for something that is working correctly in our projects (and took me a great deal of time / effort to sleuth). – Mordachai Feb 07 '14 at 19:00
  • 1
    I would suggest rewriting the code to not use boost, as not everyone would have it, but I get the concept, it's basically doing a setlocale(LC_CTYPE, ".codepage"); where codepage is the current default system code page of the OS. ). The second thing I would mention is _CONVERSION_DONT_USE_THREAD_LOCALE is not going to work if you use MFC DLLs since it's hardbaked into the prebuilt binaries. So SetThreadLocale (LOCALE_SYSTEM_DEFAULT) becomes necessary in certain cases. – Ted. Mar 01 '17 at 12:23
2

The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.

It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:

FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);

In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.

Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.

Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:

When a Unicode stream-I/O function operates in text mode (the default), the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).

This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.

int main()
{
   setlocale(LC_ALL, ".932");
   FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
   fputws(L"刃物種類\n", fout);
   fclose(fout);
}

An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.

dalle
  • 18,057
  • 5
  • 57
  • 81
  • If a string did contain a non-encodable character, then yes, I would expect ? to result. Not the case here - I'm looking at characters that clearly can be encoded in Shift-JIS, came from a Shift-JIS file (and were correctly loaded and converted into UNICODE). But when trying to write them back out, they become ??? – Mordachai Aug 22 '13 at 19:26
  • I appreciate your answer, @dalle. It's basically correct. The big issue for us is that the above changes other aspects of the locale, not just the character encoding, to Japan (code page 932). We need to respect the user's desired settings for collation and date/time formatting etc., while also getting narrow file I/O "correct" (i.e. the same as it was in our MBCS versions). See my answer for more information. – Mordachai Aug 26 '13 at 16:38
0

When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.

Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".

I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.

Codeguard
  • 7,787
  • 2
  • 38
  • 41
  • The C library (under VC++ 2012) has a number of mechanisms for handling MBCS and UNICODE, including `fputs` vs. `fputws` for example. What I am asking for may not be supported, but I think your answer doesn't show a deep understanding of the C library under Windows, and I am looking for feedback from someone who does. – Mordachai Aug 21 '13 at 15:04
  • 1
    @Mordachai: If you think `fputs` handles `MBCS`, then you don't have a deep understanding of MBCS. I believe this is the source of your problem/question as well. – Mooing Duck Aug 21 '13 at 22:07
  • If I thought that, I'd be an idiot (since the docs clearly state that fputs doesn't handle MBCS nor UNICODE. However, fputws DOES, as also stated in the docs). – Mordachai Aug 22 '13 at 14:23
  • fputs does correctly output an MBCS string, if the MBCS string encoding matches the system code page, which for any MBCS application it would. – Mordachai Aug 22 '13 at 19:27