5

I'm using Microsoft Windows 10 with mingw-w64 (gcc version 8.1.0, x86_64-posix-sjlj-rev0, Built by MinGW-W64 project) with cmd. When I try to print or store and then print a Spanish character on the Windows console, it shows an error. For example I tried to execute this program:

#include <stdio.h>

int main(void) {
    char c[20];
    printf("pía\n");
    scanf("%s", c);
    printf("%s", c);
}

If I introduce some Spanish characters the returned sentence is OK but the printed one at the beginning shows an error:

pía
laíóñaú
laíóñaú

Some solutions suggest putting setlocale() function but the results are the same. Other solution is put the UTF-8 unicode compatibility on region settings:

enter image description here

But now the error is the opposite, the printed one is OK but when I introduce a strange character the console doesn't show it:

pía
lía
l

It is a bit frustrating since all the solutions I have seen are solved with the above or by setting setlocale(), but none of them work for me and I don't know why.

EDIT

As Mofi say in comments I try to use SetConsoleCP() and SetConsoleOutputCP() to change the code page of the console. Without fully understanding how these functions work, with the same code as above, I ran several examples with wrong results:

pía                       | p├¡a                    | p├¡a                  | pía
lía                       | lía                     | lía                   | lía
l                         | l                       | lía                   | la
input: 65001 output 65001 | input: 65001 output 850 | input: 850 output 850 | input: 850 output 65001

How I don't fully understand this functions I don't know why in the last example, the console don't show the accented stored character but in the printed one it does and in the example above the opposite happens.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • If I'm not mistaken the MSYS terminal in MinGW does not support UTF-8. You may want to switch to Linux which uses UTF-8 consistently. – August Karlstrom Jan 10 '21 at 11:59
  • 1
    I recommend to read the Microsoft documentation [Console Developer's Guide & API Reference](https://learn.microsoft.com/en-us/windows/console/console-reference) and make use of the Windows console functions in your C coded application. The current [code page](https://en.wikipedia.org/wiki/Code_page) used by cmd.exe according to region (country) configured for the used account can be get with function [GetConsoleCP](https://learn.microsoft.com/en-us/windows/console/getconsolecp) and set with [SetConsoleCP](https://learn.microsoft.com/en-us/windows/console/setconsolecp). – Mofi Jan 10 '21 at 16:53
  • 1
    `GetConsoleCP` and `SetConsoleCP` are for the standard input stream. The code page for the standard output stream can be get with [GetConsoleOutputCP](https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp) and set with [SetConsoleOutputCP](https://docs.microsoft.com/en-us/windows/console/setconsoleoutputcp). There is usually used the same code page for both streams. I recommend to use code page 854 or [850](https://en.wikipedia.org/wiki/Code_page_850) for Spanish. Which code page is the default for your account on your machine as displayed on running `chcp` in a cmd window? – Mofi Jan 10 '21 at 16:54
  • 1
    See also the Microsoft documentation page [Code Page Identifiers](https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers). I suggest to look on [Using another language (code page) in a batch file made for others](https://stackoverflow.com/a/48982681/3074564) and the comments written by [eryksun](https://stackoverflow.com/users/205580/eryksun). It would be better not using UTF-8 for maximum compatibility with older Windows versions like Windows 7/XP and use OEM code page which contains all the Spanish characters which should be supported by your Windows console application. – Mofi Jan 10 '21 at 17:06
  • 1
    If you want to just solve the issue with output of `pía` being most likely encoded with the byte stream `70 ED 61 10 00` (hexadecimal) using code page [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) by your text editor __and__ your C compiler, I suggest to use `printf("\x70\xA1\x61\n"); /* That is pía OEM 850 encoded. */` to compile the string with the byte stream `70 A1 61 10 00` according to [OEM code page 850](https://en.wikipedia.org/wiki/Code_page_850) independent on which code page is used by the text editor and nearly independent which character encoding the C compiler uses. – Mofi Jan 13 '21 at 07:37
  • @Mofi last comment could be a solution but I think there should be some better solution i.e more practical. – Rafael Hernández Marrero Jan 13 '21 at 08:05
  • 1
    You have to get first knowledge about [character encoding](https://en.wikipedia.org/wiki/Character_encoding). See also the introduction chapters on [this page](https://www.ultraedit.com/support/tutorials-power-tips/ultraedit/unicode.html). You have to get knowledge how a string entered by you is stored with which byte stream in the .c source code file by your editor. Next you have to know how the used C compiler interprets the bytes of a string in a .c file on creating the byte stream of the string in the object file which is finally linked to an executable or library. – Mofi Jan 13 '21 at 11:44
  • 1
    Then you have to know or control which character encoding is used for the console so that the bytes of the string in the executable are interpreted on execution with same character encoding as on creating them by the C compiler. And last but not least the font used for text display in console window must support those characters too. So a console application using Unicode should better define also the font for console and the font is hopefully installed on the user´s machine in a version supporting the characters. There are a lots of expectations which must be fulfilled for a non ASCII output. – Mofi Jan 13 '21 at 11:49
  • 1
    On Linux it is quite simple as text editors use by default UTF-8 on Linux, C compilers interpret the UTF-8 encoded byte stream in a source code file as is by default on using a simple `char` array, the Linux terminals use by default UTF-8 as character encoding and the font used by default for terminal windows is designed for the console with an extensive support of characters in base multilingual plane of Unicode. The usage of Unicode in a Windows console requires lots of extra code as it is by design for a region dependent OEM code page for downwards compatibility with MS-DOS from the 1980s. – Mofi Jan 13 '21 at 11:55
  • @Mofi Thanks for all the information that you give me. I know that I have poor knowledge about character encoding because I didn't think possible that the encoding of the text editor could influence the interpretation of the characters in c and in the command window. – Rafael Hernández Marrero Jan 14 '21 at 11:04

2 Answers2

1

I played around with this for a while and the only thing that worked was using _setmode() to set the stdin and stdout to take in wide characters, and then working with wchar_t instead of char to store the text. This code works as intended on my machine:

#include <stdio.h>
#include <fcntl.h>
#include <io.h>

int main(void) {
    _setmode(_fileno(stdin), _O_WTEXT);
    _setmode(_fileno(stdout), _O_WTEXT);
    wchar_t c[20];
    wprintf(L"pía\n");
    wscanf(L"%ls", c);
    wprintf(L"%ls", c);
}

EDITED: I changed the parameter of _setmode from _O_U16TEXT to _O_WTEXT to avoid implementation issues resulting from how the length of wchar_t is either 2 or 4 bytes based on the compiler.

directquest
  • 130
  • 10
  • This could be a solution like the one @Mofi put in the last comment but, as I mentioned above, there should be a simpler solution to apply don't? – Rafael Hernández Marrero Jan 13 '21 at 08:09
  • Well the problem is that `scanf` doesn't take in Unicode streams as per the documentation, so you have to use `wscanf` and set the standard I/O modes to accept wide characters. You could replace `wchar_t` with `char`, but I think that would be poor practice as one `char` wouldn't correspond to one actual Unicode character, and you would need to double the size of the array. @RafaHernández – directquest Jan 13 '21 at 13:30
  • At least this way you don't have to mess with codepages and locales and whatnot. – directquest Jan 13 '21 at 13:30
0

As Mofi said in comments above, the solution comes in how the editor that I was using interpret the character that I wrote. I was using Visual Studio Code and to change the encoding that comes by default, in the lower right corner change UTF-8to CP 850. Now the editor will be able to correctly interpret Spanish character.

The next problem is change the code page of the console. With the command chcp 850 or with the functions SetConsoleCP(850) and SetConsoleOutputCP(850) we can change the code page on every console that we open. To set this by default do the folowing:

  • Open the Registry Editor and go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Command Processor.
  • New -> String Valueand put the name Autorun.
  • Modify the value to chcp 850 > nul.