Capture spawned process stdout as unicode

Question

In my C++/WinAPI code, I want to run some commands and capture their output. To test non-ASCII output, I renamed my network connection to Ethérnét אבג БбГгДд and run ipconfig. When running in command prompt, the output comes out correctly (visible when using a supporting font like Courier New):

C:\>ipconfig
Windows IP Configuration

Ethernet adapter Ethérnét אבג БбГгДд:
(...)

I tried to redirect the output to a pipe, following the example in this answer. But the byte array returned from ReadFile() is not unicode - it's encoded in CP_OEMCP (CP437 in my case), and so the Hebrew and Russian characters come out as '?'s. Since the characters are already lost, no further handling can restore them.

Obviously it's possible, since cmd in a console window does it. How can I do it?

ReadFile returns bytes, it has no idea what Unicode is. Show how your handling its buffer. — Alex K., Jan 03 '17 at 10:22
I've inspected the returned bytes from the debugger, and they're text encoded in CP437, with the Hebrew/Russian characters replaced with actual '?'s. Since the chars are lost, no handling would restore that. I wanted to know how cmd.exe (or Console window?) does manage to capture those chars correctly. — Jonathan, Jan 03 '17 at 11:07
so convert it to unicode by `MultiByteToWideChar(CP_OEMCP, ` - characters not lost — RbMm, Jan 03 '17 at 15:34
That's what I do now. However, since CP_OEMCP can't encode all characters - like the Hebrew+Russian in my example - they appear as actual '?'s, and the conversion can't recover them, since they are lost. — Jonathan, Jan 03 '17 at 15:38
`CP_OEMCP can't encode all characters` - are you sure in this ? i tnink you wrong here — RbMm, Jan 03 '17 at 15:52
@RbMm See the [list of characters in CP437](https://msdn.microsoft.com/en-us/library/cc195060.aspx). No Hebrew or Russian characters here. `MultiByteToWideChar` will not magically restore any characters not on this list. — roeland, Jan 03 '17 at 21:37
@roeland - are you try code which I paste ? try use `MultiByteToWideChar(CP_OEMCP` - ? which code you use - again nothing lost in multibyte-widechar conversions — RbMm, Jan 03 '17 at 21:41
`ipconfig.exe` used `WriteConsoleW` for output to console - as result it always correct print on any languages and not dependent from current code pages. if app use `A` functions or write to file as multi-bytes will be problem if try print characters which not exist in using code page — RbMm, Jan 03 '17 at 22:12
@RbMm If you receive data encoded in the OEM character set, the characters are already lost, and your program can do nothing to restore them. Eg. the child process outputs `"αβ"`, which is then reduced to the OEM character set (probably something like `"ab"`), and only then passed on to your program. — roeland, Jan 03 '17 at 23:06
@roeland - how I understand, when re-read OP question he used `CP437` - with this `WideCharToMultiByte(CP_OEMCP)` really lost data for Hebrew and Russian characters . `ipconfig.exe` however use `UNICODE` function for write to console - as result text displayed correct. — RbMm, Jan 03 '17 at 23:13
@RbMm: It's pretty obvious that a character set containing only 256 characters cannot be used to encode all 100.000+ Unicode characters. — MSalters, Jan 04 '17 at 00:08
@MSalters - no, because WideCharToMultiByte map 1 Unicode character to several (usual 2) multi-byte characters for non `en` - so we have not 256, but 256*256 — RbMm, Jan 04 '17 at 00:45
more exactly when we use CP_ACP or CP_OEMCP we have one to one by len unicode to multi-byte , but in case CP_UTF8 - usual one non 'en' wchar converted to 2 char — RbMm, Jan 04 '17 at 00:51
@RbMm: That supposes `CP_ACP` and `CP_OEM` are actually multi-byte. Possible, but rare, and when CP_OEM is the common CP437 it's single-byte. — MSalters, Jan 04 '17 at 01:02
@MSalters - `CP_ACP` and `CP_OEM` translate Unicode chars to selected page. it use one to one symbol conversion. if say we use `Hebrew` page - we can translate(without lost data) Hebrew and English chars, but not Russian or another language — RbMm, Jan 04 '17 at 01:35

Harry Johnston · Accepted Answer · 2019-05-05T01:58:13.357

3

It would seem that ipconfig produces Unicode output when it detects that the output device is the console, and ANSI output otherwise. This is likely to be a backwards-compatibility measure.

Most other built-in command-line tools are likely to either be ANSI-only or to behave in the same way as ipconfig, for the same reason. In Windows, command-line tools are meant, well, for use on the command line; programmers are discouraged from shelling out to them and parsing the output. Instead, you should use the corresponding APIs.

If you know which language you are expecting, you might be able to choose a code page that will preserve the content.

Added by @Jonathan: Undocumented: Turns out you can control the encoding of built-in commands using the environment variable OutputEncoding. I tested with ipconfig, but presumably it works with other built-in tools as well:

> for %e in ("" Unicode Ansi UTF8) do (set OutputEncoding=%~e& ipconfig >ipconfig-%~e.txt)
> (set OutputEncoding=  & ipconfig  1>ipconfig-.txt )
> (set OutputEncoding=Unicode  & ipconfig  1>ipconfig-Unicode.txt )
> (set OutputEncoding=Ansi  & ipconfig  1>ipconfig-Ansi.txt )
> (set OutputEncoding=UTF8  & ipconfig  1>ipconfig-UTF8.txt )

And indeed, ipconfig-*.txt are enconded as expected! Note that this is undocumented, but it does work for me.

Addendum: as of Windows 10 v1809, another alternative is to create a pseudoconsole.

edited May 05 '19 at 01:58

answered Jan 04 '17 at 00:30

Harry Johnston

35,639
6
68
158

That explains it. I looked into `ipconfig`, and added my findings to the answer. I wish we could set CP_OEMCP to CP_UTF8 (and CP_ACP too)... – Jonathan Jan 04 '17 at 12:24
@Jonathan, the code fragment you posted is only reached in the case where output is to the console, it isn't relevant to the case where output has been redirected to a pipe. It is however interesting that it is the C runtime library that is responsible for the conversion from UTF-16 to the current locale. From what I can see in the CRT source it uses `wcstomb_s` to do so, though I'm looking at the Visual Studio CRT, not quite the same as the one built into Windows. Unfortunately there doesn't seem to be any way to make the CRT generate UTF-8. – Harry Johnston Jan 04 '17 at 23:02
1

Indeed, my code was irrelevant. However, I discovered that the conversion happens inside `ipconfig.exe` - and you can control the codepage using the undocumented `OutputEncoding` env variable. I'll add a sample to your answer. – Jonathan Jan 05 '17 at 09:23
Neat find! (It may be worth posting as a separate answer, I for one would upvote it.) It is curious that the string `OutputEncoding` doesn't appear in the Visual Studio 2010 CRT source code, or in `msvcrt.dll` for that matter, but does appear in `shell32.dll` which makes me think it may be something the operating system is doing rather than the CRT. The details don't really matter though. – Harry Johnston Jan 05 '17 at 21:56
Correct - `OutputEncoding` happens in `ipconfig.exe`, not msvcrt - you can see it using SysInternal strings. It appears only to apply to some tools - `netstat.exe`, but no `robocopy.exe`. – Jonathan Jan 07 '17 at 13:12

RbMm · Answer 2 · 2017-01-03T22:35:40.620

console application can use different ways for output.

for console handle we can use WriteConsoleW for output already in UNICODE.
if we want use WriteConsoleA or WriteFile for console handle need first convert UNICODE text to multi-bytes by WideCharToMultiByte with CodePage := GetConsoleOutputCP()
if we have not UNICODE text initially for output (say UTF-8 or Ansi), need first convert it to UNICODE by MultiByteToWideChar (with CP_UTF8 or CP_ACP) and then already again convert it to multi-byte WideCharToMultiByte(GetConsoleOutputCP(), ..)

usual (by default) GetConsoleOutputCP() return same value as GetOEMCP(), so have the same effect in MultiByteToWideChar and WideCharToMultiByte as CP_OEMCP (this constant value is translated to GetOEMCP() )

when output handle is redirected to a file need only use WriteFile only. however application can write data to file in any format: UNICODE, Ansi (CP_ACP) , UTF-8 (CP_UTF8) etc. what is format will be used - very depend from concrete application. you can not full control this. usual you will receive multi-byte output in CP_OEMCP encoding. then you need decide how process it - faster of all you will be need first convert it to UNICODE and use unicode form. if you need Ansi - you will be need do else one conversion.

say if you try use pipe output in CP_OEMCP encoding with OutputDebugStringA - you got error (not readable) output for non english text. but after 2 conversions CP_OEMCP -> UNICODE -> CP_ACP you can correct displayed text with OutputDebugStringA but because OutputDebugStringW exist - here enough only to UNICODE convert

also some applications have special options for control output to file format. say ipconfig.exe looking for "OutputEncoding" Environment Variable and depended from it string value ("Unicode", "Ansi", "UTF-8") produce different output. by default (if this Environment Variable not exist or unknown value) CP_OEMCP used

example of pipe read procedure. assume that input data in CP_OEMCP encoding:

void OnRead(PVOID buf, ULONG cbTransferred)
{
    if (cbTransferred)
    {
        if (int len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, 0, 0))
        {
            PWSTR pwz = (PWSTR)alloca((1 + len) * sizeof(WCHAR));

            if (len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, pwz, len))
            {
                if (g_bUseAnsi)
                {
                    if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, 0, 0, 0, 0))
                    {
                        PSTR psz = (PSTR)alloca(cbTransferred + 1);

                        if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, psz, cbTransferred, 0, 0))
                        {
                            DoPrint(psz, cbTransferred, OutputDebugStringA);
                        }
                    }
                }
                else
                {
                    DoPrint(pwz, len, OutputDebugStringW);
                }
            }
        }
    }
}

// debugger can incomplete print too big buffer, so split it on small chunks
template<typename T> void DoPrint(T* p, ULONG len, void (WINAPI* fnOutput)(const T*))
{
    ULONG cb;
    T* q = p;
    do 
    {
        cb = min(len, 256);

        q = p + cb;

        T c = *q;

        *q = 0;

        fnOutput(p);

        *q = c;

        p = q;

    } while (len -= cb);
}

about your concrete case - ipconfig.exe used WriteConsoleW for output to console. as result it not depended from current system locale and can correct display multilanguage text. but another tools, like route.exe used WriteFile for ouput (both to console and file) and convert before this UNICODE text to multi-byte by WideCharToMultiByte(CP_OEMCP,..) - as result here will be problems, if try display characters which not exist in CP_OEMCP code page (current system locale). if you have CP437 - Hebrew and Russian characters will be lost if use UNICODE -> CP_OEMCP, need only direct ouput with unicode to console and file. are this possible - dependend from concrete application. for say route.exe this not possible. for ipconfig.exe this possible, because it always write to console in unicode format, and can write to file also in unicode or utf-8 if you set "OutputEncoding" to "Unicode" or "UTF-8"

This fails to account for multi-byte characters that straddle packages. If [IsDBCSLeadByte](https://msdn.microsoft.com/en-us/library/windows/desktop/dd318664.aspx) is `TRUE` for the final code unit, the conversion breaks both this block as well as the following block of bytes. — IInspectable, Feb 20 '17 at 10:02

Capture spawned process stdout as unicode

2 Answers2