0

First of all, I just want to use Baltic characters in console AND execute CMD commands with them but problem start with default/standart console c++ application.

#include <iostream>
int main() {
    string output = "āāāčččēēēē";

    cout << output << endl;
}

Earlier, I made this question on stack - How to use UTF8 characters in DEFAULT c++ project OR when using mysql connector for c++ in visual studio 2019 (Latin7_general_ci to UTF-8)?

What I discovered in testing: If I convert UTF8 string to Latin1 string, then cout or print hex values, I get some special characters to be outputted in console. For example -

**char s2[256] = "\xc3\xa9";**  printed is outputted as "ķ" THAT MEANS I need to convert strings into correct HEX values when it is needed, and some people might know how it might be one.

BUT MY CODE LOGIC needs a feature TO USE THIS STRING TO use cp in CMD. So converting to string later, fails my CMD to work, although the output of the cp command CMD has to execute seems to show correctly in console.

// Example program
#include <iostream>
#include <string>
#include <fstream>
#include <sstream> 
#include <stdexcept>
#include <stdlib.h> 
#include <stdio.h> 
#include <time.h> 
#include <cstring> 
#include <cstdint>
#include <locale> 
#include <cstdlib>





int GetUtf8CharacterLength(unsigned char utf8Char)
{
    if (utf8Char < 0x80) return 1;
    else if ((utf8Char & 0x20) == 0) return 2;
    else if ((utf8Char & 0x10) == 0) return 3;
    else if ((utf8Char & 0x08) == 0) return 4;
    else if ((utf8Char & 0x04) == 0) return 5;

    return 6;
}

char Utf8ToLatin1Character(char* s, int* readIndex)
{
    int len = GetUtf8CharacterLength(static_cast<unsigned char>(s[*readIndex]));
    if (len == 1)
    {
        char c = s[*readIndex];
        (*readIndex)++;

        return c;
    }

    unsigned int v = (s[*readIndex] & (0xff >> (len + 1))) << ((len - 1) * 6);
    (*readIndex)++;
    for (len--; len > 0; len--)
    {
        v |= (static_cast<unsigned char>(s[*readIndex]) - 0x80) << ((len - 1) * 6);
        (*readIndex)++;
    }

    return (v > 0xff) ? 0 : (char)v;
}

// overwrites s in place
char* Utf8ToLatin1String(char* s)
{
    for (int readIndex = 0, writeIndex = 0; ; writeIndex++)
    {
        if (s[readIndex] == 0)
        {
            s[writeIndex] = 0;
            break;
        }

        char c = Utf8ToLatin1Character(s, &readIndex);
        if (c == 0)
        {
            c = '_';
        }

        s[writeIndex] = c;
    }

    return s;
}


int main()
{
    char s2[256] = "\xc3\xa9";
    Utf8ToLatin1String(s2);

    std::cout << s2 << std::endl;

    std::string locations2 = ("C:\\Users\\Janis\\Desktop\\TEST2\\");
    std::string txtt = (".txt");
    std::string copy2 = ("copy /-y ");

    std::string space = " ";
    std::string PACIENTI2 = "C:\\PACIENTI\\";




    std::string element = copy2 + locations2 + s2 + txtt;

    std::string cmd = element + space + PACIENTI2 + s2 + txtt;

    std::cout << cmd << std::endl;

    FILE* pipe = _popen(cmd.c_str(), "r");
}

So we need to really solve two problems, creating hex string from already given, and making sure it works in CMD.

Ronalds Mazītis
  • 323
  • 2
  • 5
  • 18
  • 1
    Easy path for developing on MS-Windows is to always use UTF-16. This is the native Unicode standard on MS-Windows. – Richard Critten May 31 '21 at 12:14
  • The documentation You just gave me, does not provide actual example on simple string conversation, We need "HELLO WORLD" type example, where I actually don''t have to go way beyond typing HELLO WORLD. – Ronalds Mazītis May 31 '21 at 12:39
  • See also https://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows – Richard Critten May 31 '21 at 14:40

1 Answers1

5

I've provided you a very good answer in your other question. Here is something similar.

Your program can use UTF-8 encoding and console can use different encoding, but you have to give a hint to standard library how are encoded each data sources.
Of course if destination encoding do not cover do not support specific character some fallback have to kick in (see example at bottom).

There are 4 areas where encoding must be well defined to make everything work:

  • Your source code. VS used system locale to use encoding and this is bad. Force VS and all editors to use encoding which is universal, UTF-8 choice is best. It is best to inform compiler how source is encoded: cl /source-charset:utf-8 .....
  • Your executable. You have to define what kind of encoding string literals should be encode in final executable. Here UTF-8 is also the best: cl .... /execution-charset:utf-8 .....
  • When you run application you have to inform standard library what kind of encoding your string literals are define in or what encodings in program logic is used. So somewhere in your code at beginning of execution you need something like this:
std::locale::global(std::locale{".utf-8"});
  • and finally you have to instruct stream what kind of encoding it should use. So for std::cout and std::cin you should set locale which is default for the system:
    auto streamLocale = std::locale{""}; 
    // this impacts date/time/floating point formats, so you may want tweak it just to use sepecyfic encoding and use C-loclae for formating
    std::cout.imbue(streamLocale);
    std::cin.imbue(streamLocale);

After this everything should work as desired without code which explicitly does conversions.
Since there are 4 places to make mistake, this is reason people have trouble with it and internet is full of "hack" solutions.

Here is some test program to prove my point:

#include <iostream>
#include <locale>
#include <exception>
#include <string>

void setupLocale(int argc, const char *argv[])
{
    std::locale def{""};
    std::locale::global(argc > 1 ? std::locale{argv[1]} : def);
    auto streamLocale = argc > 2 ? std::locale{argv[2]} : def;
    std::cout.imbue(streamLocale);
    std::cin.imbue(streamLocale);
}

void printSeparator()
{
    std::cout << "---------\n";
}

void printTestStuff()
{
    std::cout << "Wester Europe: āāāčččēēēēßÞÖöñÅÃ\n";
    std::cout << "Central Europe: ąĄÓóŁłĘężćźŰűÝýĂă\n";
    std::cout << "China: 字集碼是把字符集中的字符编码为指定集合中某一对象\n";
    std::cout << "Korean: 줄여서 인코딩은 사용자가 입력한\n";
}

int main(int argc, const char *argv[]) {
    try{
        setupLocale(argc, argv);
        printSeparator();
        printTestStuff();
        printSeparator();
    }
    catch(const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }
}

And how it was build and run to show that it works (note this also covers scenarios when invalid encoding is used):

C:\Users\User\Downloads>cl /source-charset:utf-8 /execution-charset:utf-8 /EHsc encodings.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29336 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

encodings.cpp
Microsoft (R) Incremental Linker Version 14.28.29336.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:encodings.exe
encodings.obj

C:\Users\User\Downloads>chcp
Active code page: 437

C:\Users\User\Downloads>encodings.exe
---------
Wester Europe: Ä?Ä?Ä?Ä?Ä?Ä?Ä"Ä"Ä"Ä"AYAzA-AA±A.Aƒ
Central Europe: Ä.Ä,A"A3Å?Å,Ä~ÄTżÄ╪źŰűA?A½Ä,ă
China: å--é>+碼æ~_æSSå--ç¬▌é>+ä,-çs,å--ç¬▌ç¼-ç ?ä,ºæO╪årsé>+å?^ä,-æY?ä,?å_1象
Korean: ì,ì-¬ì,o ì?,ì½"ë"cì?? ì,¬ìscìz?ê°? ìz.ë ¥ío
---------

C:\Users\User\Downloads>encodings.exe .65001
---------
Wester Europe: aaaccceeeeß_ÖöñÅA
Central Europe: aAOóLlEezczUuYyAa
China: ????????????????????????
Korean: ??? ???? ???? ???
---------

C:\Users\User\Downloads>encodings.exe .65001 .437
---------
Wester Europe: aaaccceeeeß_ÖöñÅA
Central Europe: aAOóLlEezczUuYyAa
China: ????????????????????????
Korean: ??? ???? ???? ???
---------

C:\Users\User\Downloads>encodings.exe .65001 .1250
---------
Wester Europe: aaaccceeeeß_ÖöñÅA
Central Europe: aAOóLlEezczUuYyAa
China: ????????????????????????
Korean: ??? ???? ???? ???
---------

C:\Users\User\Downloads>chcp 1250
Active code page: 1250

C:\Users\User\Downloads>encodings.exe .65001 .1250
---------
Wester Europe: aaačččeeeeß?ÖönAA
Central Europe: ąĄÓóŁłĘężćźŰűÝýĂă
China: ????????????????????????
Korean: ??? ???? ???? ???
---------

C:\Users\User\Downloads>chcp 65001
Active code page: 65001

C:\Users\User\Downloads>encodings.exe
---------
Wester Europe: ÄÄÄÄÄÄēēēēßÞÖöñÅÃ
Central Europe: ąĄÓóÅłĘężćźŰűÃýĂă
China: 字集碼是把字符集中的字符编ç ä¸ºæŒ‡å®šé›†åˆä¸­æŸä¸€å¯¹è±¡
Korean: 줄여서 ì¸ì½”ë”©ì€ ì‚¬ìš©ìžê°€ 입력한
---------

C:\Users\User\Downloads>encodings.exe .65001
---------
Wester Europe: āāāčččēēēēßÞÖöñÅÃ
Central Europe: ąĄÓóŁłĘężćźŰűÝýĂă
China: 字集碼是把字符集中的字符编码为指定集合中某一对象
Korean: 줄여서 인코딩은 사용자가 입력한
---------

C:\Users\User\Downloads>encodings.exe .65001 .65001
---------
Wester Europe: āāāčččēēēēßÞÖöñÅÃ
Central Europe: ąĄÓóŁłĘężćźŰűÝýĂă
China: 字集碼是把字符集中的字符编码为指定集合中某一对象
Korean: 줄여서 인코딩은 사용자가 입력한
---------

C:\Users\User\Downloads>
Marek R
  • 32,568
  • 6
  • 55
  • 140
  • If I set an encoding to `std::cout`, does that make `cout` reencode the strings before printing? For eg, if I set `imbue` to UTF-8 and my windows console is in code page 437, does cout assumes input string is in UTF-8? and does it try to convert that input to code page 437? – Sourav Kannantha B Jan 19 '23 at 16:55
  • 1
    This is what this demo proves. Note if conversion is impossible `?` is used as a fallback. Note also that `ą` can be replaced with `a` as a fallback and other characters are tweaked depending on destination encoding. – Marek R Jan 19 '23 at 16:56
  • How does `cout` knows the active code page? Can I also query that using standard C++? – Sourav Kannantha B Jan 19 '23 at 17:02
  • Everything is described in an answer. Take a look on all bullet points. Windows "code page" is same as "character encoding". – Marek R Jan 19 '23 at 17:04
  • Sorry, I was confused a bit. So `std::cout` always converts from `global` encoding to `imbue` encoding and `std::cin` does vice-versa. Is that right? – Sourav Kannantha B Jan 19 '23 at 17:09
  • 1
    Only if `stream::imbue` and `std::local::global` is set. – Marek R Jan 19 '23 at 17:30