2

For a little side project I need to output strings of text in Windows' CMD that may be localized, and some strings are read from the arguments of the program. To simplify matters I'll be using a simple echo program as a demonstration.

Please consider snippet in C language:

#include <stdio.h>

int main(int argc, char **argv) {
    // Display the first argument through the standard output:
    if (argc > 1)
        puts(argv[1]);
    return 0;
}

These are the outputs of two executions:

$ test.exe Wilhelm
$ Wilhelm

$ test.exe Röntgen
$ R÷ntgen

There you can already see that things like ö which would be out of ASCII would not be displayed correctly. But they're correctly recognized in the program, for example if you do something like:

if (argv[1][1] == 'ö')
    puts("It is.");

The sentence would be displayed, so the program is receiving the characters correctly.

So I though, OK, that wchar_t thing may be needed, so making the appropriate changes and defined UNICODE and _UNICODE you'd get:

#include <stdio.h>

int wmain(int argc, wchar_t **argv) {
    // Display the first argument through the standard output:
    if (argc > 1)
        _putws(argv[1]);
    return 0;
}

Still the output of this test program would be the same.

Looking around and reading docs I found somewhat of a workaround, which is to set the locale to English for example: the text would then be displayed correctly. Modifying the first version (without wchar_ts) I ended up with this:

#include <stdio.h>
#include <locale.h>

int main(int argc, char **argv) {
    // Get the previous locale and change to English:
    char *old_locale = setlocale(LC_ALL, NULL);
    setlocale(LC_ALL, "English");
    // Display the first argument through the standard output:
    if (argc > 1)
        puts(argv[1]);
    // Restore locale:
    setlocale(LC_ALL, old_locale);
    return 0;
}

("en-US" doesn't seem to work in MinGW-w64 while "English" works with it and Microsoft Visual C++)

Now the program is able to print so that the character is actually displayed correctly in the command line window.

The problem is that setting things to English is not the best thing to do in a Spanish system, or a Japanese one for example. So I thought about getting the locale from the system in some way. I found a function called _get_current_locale which returns a _locale_t, but it seems not to be what I wanted at all:

_locale_t_variable->locinfo->lc_category[LC_ALL].locale (which is a char *) seems to be NULL.

So the question is, how to get or display text in the locale of the command line? What would be the right way to deal with localized text in Windows' CMD (not necessarily in Unicode)?

James Russell
  • 339
  • 1
  • 3
  • 12
  • Your question has merit. The `echo` program can correctly echo `Röntgen` on my Win7 machine; so what you are trying to do is apparently possible. – Mahonri Moriancumer May 18 '14 at 22:16
  • But, then again, `echo` is internal to the MS cmd shell. It could have 'special' handling by the shell... – Mahonri Moriancumer May 18 '14 at 23:10
  • By default, the command prompt uses the OEM code page. Setting the C locale is irrelevant. You can, however, change this code page. – Cody Gray - on strike May 18 '14 at 23:50
  • Also see: [What encoding/code page is cmd.exe using](http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using) and [Can command prompt display unicode characters?](http://stackoverflow.com/questions/4879156/can-command-prompt-display-unicode-characters) – Cody Gray - on strike May 18 '14 at 23:52
  • SetConsoleCP() and/or SetConsoleOutputCP() with either CP_UTF8 or 65001 doesn't make the program output characters as they're inputed in the program's arguments. My current code page is 437 (not Unicode) and can input/output those characters, I'll add that information in the question. – James Russell May 19 '14 at 00:40
  • @ Mahonri Moriancumer Your assumption si wrong: Your notification is just because win7 is using not strict ansi, win7 C uses WinANSI isntead which supports letters like ä ü ö ß etc. – dhein May 19 '14 at 13:19
  • @James Russell What you are trying is groudning on implementation defined behaving, so not the best starting for a programm at all. please correct me, when I'm wrong. AFAIK the output you get is jsut internal correct because the binary codes fit each other. but you cant rely on that. because it may vary from system to system. If you really would like to fix it you had to go much deeper and write your own output formatting + character encoding. – dhein May 19 '14 at 13:22

1 Answers1

0

"These are the two outputs...": If you are using cmd.exe, why does the prompt resemble a dollar-sign? Did you set it that way? If you really are using cmd.exe, you can check the "code page" with:

mode con cp /status

If you find that it's 437, that could explain your unexpected observation. Open up charmap.exe and you'll find that the character you're concerned about is called "U+00F6 Latin Small Letter O With Diaresis". If you paste this into the CLI using code page 437, some interesting things happen...

The code that will be passed to a unicode program will be: 0xF6, 0x00 Your program will receive this code.

The character is recognized as existing in code page 437, but with the code 0x94. I believe that the CLI (including the echo command) performs some WYSIWYG and this latter code (0x94) is displayed to you and output to stdout.

If you copy the character to the clipboard from the CLI, it will have gained an additional association with "OEM text" and the 0x94 code.

Now let's switch to code page 1252:

mode con cp select=1252

In this code page, when you paste from Character Map into the CLI, the code passed to a unicode program remains the same as in the previous scenario.

But now the character you observe is 0xF6 in the Terminal font (a font which visually resembles code page 437) and so you have the division-sign. The echo command will send this same code to stdout.

If you copy the character to the clipboard from the CLI, it will have gained an additional association with "OEM text" and the 0x94 code, the same as before.

If you redirect the output of the echo command with this character to a file and open a file in Notepad using the Terminal font, you'll see the division-sign. If you change the font to Courier New, you'll see the "small o with diaresis," as per Unicode.

Now switch back to code page 437:

mode con cp select=437

If you want a Windows unicode program to output untranslated Unicode sequences to a FILE *, I believe you have to use binary mode. To modify your original code, you might have:

#define _UNICODE

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>

#include <tchar.h>
#include <fcntl.h>
#include <io.h>

int __cdecl _tmain(int argc, TCHAR ** argv, TCHAR ** envp) {
    wchar_t bom = 0xFEFF;

    _setmode(_fileno(stdout), _O_BINARY);

    _ftprintf(stdout, _T("%c"), bom);
    _putts(argv[1]);

    return EXIT_SUCCESS;
  }

In this example, we write the UTF-16LE Byte Order Mark ("BOM") before writing the UTF-16 characters in the argument to stdout. This will look ugly in the CLI, but if you redirect to a file or work directly with a file (in binary mode), the results are probably more along the lines of what you were originally interested in:

#define _UNICODE

#ifdef _UNICODE
#define BOM { 0xFF, 0xFE, 0, 0 }
#else
#define BOM { 0 }
#endif

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>

#include <tchar.h>
#include <fcntl.h>
#include <io.h>

int __cdecl _tmain(int argc, TCHAR ** argv, TCHAR ** envp) {
    /* Initialize the BOM string */
    static const union {
        unsigned char bytes[sizeof (TCHAR) * 2];
        TCHAR c[2];
      } bom = BOM;
    FILE * f;
    TCHAR filename[] = _T("testfile.txt");
    int r;
    int rc;

    /* Assume failure */
    rc = EXIT_FAILURE;

    if (argc != 2) {
        _ftprintf(stderr, _T("Usage: %s <word>\n"), argv[0]);
        goto err_usage;
      }

    f = _tfopen(filename, _T("wb"));
    if (!f) {
        _ftprintf(stderr, _T("Could not open file: %s\n"), filename);
        goto err_fopen;
      }

    r = _ftprintf(f, _T("%s"), bom.c);
    if (r != _tcsclen(bom.c)) {
        _ftprintf(stderr, _T("Could not write BOM to file\n"));
        goto err_bom;
      }

    r = _ftprintf(f, _T("%s"), argv[1]);
    if (r !=  _tcsclen(argv[1])) {
        _ftprintf(stderr, _T("Could not write argument to file\n"));
        goto err_arg;
      }

    rc = EXIT_SUCCESS;

    err_arg:

    err_bom:

    fclose(f);
    err_fopen:

    err_usage:

    return rc;
  }

Here are some additional resources which might help:

_tfopen: http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx

_ftprintf: http://msdn.microsoft.com/en-us/library/xkh07fe2.aspx

_setmode: http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

About Unicode with text and binary streams: http://msdn.microsoft.com/en-us/library/c4cy2b8e.aspx

SBCS, MBCS, Unicode functions: http://msdn.microsoft.com/en-us/library/tsbaswba.aspx

Shao
  • 537
  • 3
  • 7