4

I'm trying to understand some behavior I'm seeing.

I have this C++ program:

// Outputter.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>


int main()
{
    // UTF-8 bytes for "日本語"
    std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
    return 0;
}

If I run the following in Powershell:

[System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
.\print_it.exe # This is the above program ^
日本語 # This is the output as displayed in Powershell

Then 日本語 is printed and displayed correctly in Powershell.

However if I add setlocale(LC_ALL, "English_United States.1252"); to the code, like this:

int main()
{
    setlocale(LC_ALL, "English_United States.1252");

    // UTF-8 bytes for "日本語"
    std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
    return 0;
}

The program now prints garbage to Powershell (日本語 to be precise, which is the code page 1252 misinterpretation of those bytes).

BUT if I pipe the output to a file and then cat the file, it looks fine:

.\print_it.exe > out.txt
cat out.txt
日本語 # It displays fine, like this, if I redirect to a file and cat the file.

Also, Git bash displays the output properly no matter what I setlocale to.

Could someone please help me understand why setlocale is affecting how the output is displayed in Powershell, even though the same bytes are being written to stdout? It seems like Powershell is somehow able to access the locale of the program and uses that to interpret output?

Powershell version is 5.1.17763.592.

Aurast
  • 3,189
  • 15
  • 24

1 Answers1

3

It is all about encoding. The reason why you are getting correct characters with the > redirect is due to the fact the > redirect uses UTF-16LE by default. So your set encoding 1252 is automagically converted to UTF-16.

Depending on your PowerShell version you can or can not change the encoding of the redirect.

If you would use Out-File with -Encoding switch you could change the encoding of the destination file (again depends on your PowerShell version).

I recommend reading SO excellent mklement0's post on this topic here.

Edit based on comment

Taken from cppreference

std::setlocale C++ Localizations library Defined in header <clocale>

char* setlocale( int category, const char* locale);

The setlocale function installs the specified system locale or its portion as the new C locale. The modifications remain in effect and influences the execution of all locale-sensitive C library functions until the next call to setlocale. If locale is a null pointer, setlocale queries the current C locale without modifying it.

The bytes you are sending to std::cout are the same, but std::cout is a locale-sensitive function so it take precedence over your PowerShell UTF-8 settings. If you leave out the setlocale() function the std::cout obeys the shell encoding.

If you have Powershell 5.1 and above the > is an alias for Out-File. You can set the encoding via $PSDefaultParameterValues:

like this:

$PSDefaultParameterValues['Out-File:Encoding'] = 'UTF8'

Then you would get an UTF-8 file (with BOM which can be annoying!) instead of the default UTF-16LE.

Edit - adding some details as requested by OP

PowerShell is using OEM code page so by default you are getting what you have setup at your windows. I recommend reading an excelent post on encoding on windows. The point is that without your UTF8 setting to the powershell you are on your code page which you have.

The output.exe is setting the locales to English_United States.1252 within the c++ program and output_original.exe is not doing any changes to it:

Here is the output without the UTF8 PowerShell setting:

c:\t>.\output.exe
æ-¥æo¬èªz  --> nonsese within the win1252 code page
c:\t>.\output.exe | hexdump
0000000 97e6 e6a5 ac9c aae8 009e --> both hex outputs are the same!
0000009
c:\t>.\output_original.exe
日本語  --> nonsense but different one! (depens on your locale setup - my was English)
c:\t>.\output_original.exe | hexdump
0000000 97e6 e6a5 ac9c aae8 009e  --> both hex outputs are the same!
0000009

So what happens here? Your program gives out an output based either on the locale set in the program itself or windows (which is OEM code 1252 at my virtual machine). Notice that in both versions the hexdump is the same, but not the output (with encoding).

If you set your PowerShell to UTF8 with the [System.Text.Encoding]::UTF8:

PS C:\t> [System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
PS C:\t> .\output.exe 
日本語  --> the english locales 1252 set within program notice that the output is similar to the above one (but the hexdump is different)
PS C:\t> .\output.exe | hexdump
0000000 bbef 3fbf 3f3f 0a0d  -> again hex dump is same for both so they are producing the same output!
0000008
PS C:\t> .\output_original.exe
日本語 --> correct output due to the fact you have forced the PowerShell encoding to UTF8, thus removing the output dependence on the OEM code (windows)
PS C:\t> .\output_original.exe | hexdump
0000000 bbef 3fbf 3f3f 0a0d -> again hex dump is same for both so they are producing the same output!
0000008

What happens here? If you force the locales at your c++ application the std:cout will be formatted with that locales (1252) those characters are then transformed into UTF8 formatting (that is the reason why the first and second examples are little bit different). When you do not force the locales in your c++ application then the PowerShell encoding is taken, which is now UTF8 and you get correct output.

One thing that is I found interesting is if you change your windows system locales to chinese compatible ones (PRC, Macao, Tchaiwan, Hongkong, etc.) you will get some chinese charactes when not forcing UTF8, but different ones. That means that those bytes are Unicode only and thus only there it works. If you force the UTF8 at PowerShell even with the chinese windows system locales it works correctly.

I hope this answers your question to greater extent.

Rant: It took me so long to investigate because the VS 2019 community edition got expired (WFT MS?) and I could not registre it because the register window was completely blank. Thanks MS but no thanks.

tukan
  • 17,050
  • 1
  • 20
  • 48
  • I think I understand why outputting to a file and catting the file works, or at least that doesn't strike me as mysterious (actually I'm outputting UTF-8 and that's getting automagically converted, not 1252) but I don't understand why the same bytes written to stdout are getting rendered differently by Powershell depending on the locale set inside of the C++ code. – Aurast Nov 22 '19 at 00:47
  • @Aurast well you are outputing to UTF-8 with `[System.Text.Encoding]::UTF8`, but as said in my answer the `>` converts it to UTF-16 (even when the base is UTF8). I see so you are surprised that `setlocale()` works. I'll edit my answer. – tukan Nov 22 '19 at 08:02
  • I'm going to mark this as answer but I was hoping for more specifics. Would be grateful if you could update if you know more. Like: what exactly does that program give to the console that is different depending on the locale? Does it say "hey console, here are some bytes, and by the way my locale is set to 1252", and Powershell uses that locale information but git bash does not? I'm still unsure of what's going on behind the scenes that causes this difference in behavior between different shells. – Aurast Nov 23 '19 at 02:12
  • @Aurast I will add those details, but I need to get as close to your env as possible. What compiler you use for your c++ code? – tukan Nov 23 '19 at 10:49
  • @Aurast as you are using `stdafx.h` I bet you are using VS. I'll comment on that. – tukan Nov 25 '19 at 10:57
  • Thank you, yes I am using VS. Specifically: "Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x86" – Aurast Nov 25 '19 at 16:40
  • @Aurast see my added information. I have similar powershell (5.1.14409.1018) and I have investigated it on Win7 and VS 2019 – tukan Nov 27 '19 at 15:36