4

My console window have a code page of 437, and I have echoed Russian letters in the console window:

echo привет

And I got the correct Russian output, which is:

привет

But why am I getting the correct Russian output, shouldn't I get 6 question marks as output ("??????")? The reason why I think I should get "??????" as output is because before the string "echo привет" is sent to the stdin buffer, it should be converted into the 437 code page (which will produce "??????" since those Russian letters don't exist in the 437 code page) and then the converted string would be sent to the stdin buffer, and then the "??????" string would be retrieved from the stdin buffer by cmd.exe and cmd.exe would print it to the console window.

I know that this is what should happen because I created a C program that sets the code page of the console window it is associated with to 437, and then I would send the program the "привет" Russian letters and then the program will print it to the console window (what will be printed is the "??????" string), this is the code for my program:

#include <Windows.h>
#include <stdio.h>  

int main()
{
    SetConsoleOutputCP(437);
    SetConsoleCP(437);

    char str[1212];
    gets(str);

    printf(str);

    return 0;
}

I am using the classic console window (and not PowerShell), and I am using Windows 10.

user8240761
  • 975
  • 1
  • 10
  • 15
  • 1
    Why do you expect the console host to go through all the trouble of querying for the "current" code page, translating the (UTF-16) input into CP encoding, then doing its thing, and re-encoding the (presumed) CP-encoded output back to UTF-16? I haven't looked, but I strongly doubt that the de-and-re-encoding is happening. That out of the way, what's the problem you need to solve? – IInspectable Feb 27 '23 at 14:25
  • @IInspectable I expect the console window to do that because it is doing that for my program, so why wouldn't it do that for cmd.exe also? I mean can the console window know that it is talking to cmd.exe in the first place? That is, if the console window knows that it is talking to cmd.exe then yes it can decide not to go through all the trouble you mentioned, but I don't think the console window knows who it is talking to. – user8240761 Feb 27 '23 at 16:20
  • Why are you using 8 bit text with code page 437 and hoping to represent Russian text? – David Heffernan Feb 27 '23 at 16:26
  • @David Heffernan I am just doing this to understand how this stuff works, I am not working on a real project. – user8240761 Feb 27 '23 at 16:31
  • OK, so why would you expect this to work? Do you have an actual goal? – David Heffernan Feb 27 '23 at 16:38
  • @David Heffernan What I expected is to see the same behavior that happened with my program to also happen with cmd.exe, and since this isn't the case, I am just wondering why is that. – user8240761 Feb 27 '23 at 16:47
  • Your program has opted in to quirks mode, codepage encoding. A character encoding that's not self-sufficient, and folks that care about text don't use it. If you wish to keep your sanity, call [`_getws`](https://learn.microsoft.com/en-us/cpp/c-runtime-library/gets-getws) and [`wprintf`](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/printf-printf-l-wprintf-wprintf-l) instead, and drop the calls that set the console codepage. – IInspectable Feb 27 '23 at 18:58
  • @IInspectable I would use Unicode functions if I'm working on a real project, but I'm not, I'm just trying to understand how the Windows console works. – user8240761 Feb 28 '23 at 08:33
  • This seems like an utterly pointless quest. Why would you want to learn about something that is useless? Why is codepage 437 interesting to you? – David Heffernan Feb 28 '23 at 15:33
  • If you are destined to learn how the "classic" console host works, you can look it up. Its source code is [public](https://github.com/microsoft/terminal). – IInspectable Feb 28 '23 at 22:46
  • @David Heffernan I don't think that my question is useless, and I don't care about codepage 437, I just want to understand why cmd.exe and my program show different behaviors. – user8240761 Mar 02 '23 at 04:54
  • What version of Windows 10 are you using? – tukan Mar 02 '23 at 08:20
  • @tukan I'm using version 22H2. – user8240761 Mar 02 '23 at 08:29
  • If you are trying to understand cmd.exe, why are you using 437? What makes you think that cmd.exe is doing that? – David Heffernan Mar 02 '23 at 10:23

1 Answers1

4

Edited: due the discussion with the OP below

I have only access to windows 10, 1903. Since you have a newer version I think that should apply too.

CMD interpreter

The cmd.exe has excellent Unicode support from what I know, it does not matter which codepage you are running. Your premise that there is a codepage conversion is incorrect. Since the ECHO command is an internal cmd.exe command it supports unicode.

You typing привет in the cmd.exe and using ECHO command means that you are actually using Unicode cmd.exe with Unicode command ECHO which results into printing a Unicode string even with the pagecode 437. The cmd.exe interpreter is a sub-process of the console as it works from within and it has a Unicode defined. Thus you are getting correctly привет printed with echo привет.

Your C program

Your C program forces the console code page on the calling process to translate the console input into corresponding character value. Which produces your ?????? because code page 437 does not understand Cyrillic's привет.

To quote the SetConsoleCP function MSDN:

Sets the input code page used by the console associated with the calling process. A console uses its input code page to translate keyboard input into the corresponding character value.

BONUS

For more study of the cmd.exe dicussion you can read excellent answers on SO to these questions (I don't want to duplicate information):

Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)

How to use unicode characters in Windows command line?

tukan
  • 17,050
  • 1
  • 20
  • 48
  • *"Your premise that cmd.exe is a console is wrong"* I didn't say that I think that cmd.exe is a console, when I used the term "console" in my question, I was talking about a separate program that cmd.exe communicate with. – user8240761 Mar 02 '23 at 09:01
  • Your C application is the one which is deforming the string. You are using non-unicode `char` instead of `wchar_t` . Console is not as it is Unicode as I have written above. – tukan Mar 02 '23 at 09:08
  • I don't think that you understood my question (also I know that `echo` is an internal command, I forgot to mention that in my first comment). The most important part of my question is the following: when you type russian letters in the console window, wouldn't those russian letters be converted first to the character encoding of the input code page of the console (which is set using `SetConsoleCP()`) and then the converted string will be placed in the stdin buffer, and then the process that is associated with the console window will retrieve this converted string from the stdin buffer? – user8240761 Mar 02 '23 at 09:31
  • @user8240761 Apparently, I'm still fuzzy on the your question. Your question is if you do run your C application (`some.exe` with the above source code), with the `SetConsoleCP()` function set to *437*, in the `cmd.exe` interpreter; why you see the correct Cyrillics instead of incorect `??????`? – tukan Mar 02 '23 at 11:25
  • My question is why does cmd.exe apparently receives correctly the russian letters "привет" when I type in the console the string "echo привет", but when I type these same russian letters ("привет") in the console to send it to my program, my program receives this string as question marks! (now it makes sense that the string "привет" is received as question marks by my program because the code page of the console is 437, and since the russian letters don't exist in the 437 code page, they will be replaced by question marks, but why cmd.exe didn't receive "привет" as question marks also?!). – user8240761 Mar 02 '23 at 12:09
  • @user8240761 I think I understand now. Your program does receive the question marks due to the fact you have specified `SetConsoleCP(437)`, as it *Sets the input code page used by the console associated with the calling process.* this is not getting ignored and abides to the code page (for which there is probably a case in `cmd.exe` to abide). On the other hand the command in the `cmd.exe` is Unicode and it ignores your code page so it prints the "привет" as Unicode characters instead of question marks. That is the reason why you are getting different results. – tukan Mar 02 '23 at 13:24
  • Yes exactly, this is what I mean. – user8240761 Mar 02 '23 at 13:27
  • @user8240761 Great. I'll edit the answer to reflect our discussion – tukan Mar 02 '23 at 13:45