0

I am noob in C++ so I am very sorry for asking stupid question.

I have a piece of text: Павло

I get it somewhere from console output in piece of code I am working on. I know that this is cyrillic word hidded behind it. It's real value is "Петро".

With online encoding detector I have found that to read this text properly, I have to convert it from UTF-8 to Windows 1252.

How can I do it with code?

I have tried this, it gives some results, but it outputs 5 questionmarks (at least lenght expected)

    wchar_t *CodePageToUnicode(int codePage, const char *src)
{
    if (!src) return 0;
    int srcLen = strlen(src);
    if (!srcLen)
    {
        wchar_t *w = new wchar_t[1];
        w[0] = 0;
        return w;
    }

    int requiredSize = MultiByteToWideChar(codePage,
        0,
        src, srcLen, 0, 0);

    if (!requiredSize)
    {
        return 0;
    }

    wchar_t *w = new wchar_t[requiredSize + 1];
    w[requiredSize] = 0;

    int retval = MultiByteToWideChar(codePage,
        0,
        src, srcLen, w, requiredSize);
    if (!retval)
    {
        delete[] w;
        return 0;
    }

    return w;
}

char *UnicodeToCodePage(int codePage, const wchar_t *src)
{
    if (!src) return 0;
    int srcLen = wcslen(src);
    if (!srcLen)
    {
        char *x = new char[1];
        x[0] = '\0';
        return x;
    }

    int requiredSize = WideCharToMultiByte(codePage,
        0,
        src, srcLen, 0, 0, 0, 0);

    if (!requiredSize)
    {
        return 0;
    }

    char *x = new char[requiredSize + 1];
    x[requiredSize] = 0;

    int retval = WideCharToMultiByte(codePage,
        0,
        src, srcLen, x, requiredSize, 0, 0);
    if (!retval)
    {
        delete[] x;
        return 0;
    }

    return x;
}
int main()
{
    const char *text = "Павло";

    // Now convert utf-8 back to ANSI:
    wchar_t *wText2 = CodePageToUnicode(65001, text);

    char *ansiText = UnicodeToCodePage(1252, wText2);
    cout << ansiText;
    _getch();

}

also tried this, but it's not working propery

int main()
{
    const char *orig = "Павло";
    size_t origsize = strlen(orig) + 1;
    const size_t newsize = 100;
    size_t convertedChars = 0;
    wchar_t wcstring[newsize];
    mbstowcs_s(&convertedChars, wcstring, origsize, orig, _TRUNCATE);
    wcscat_s(wcstring, L" (wchar_t *)");

    std::wstring strUTF(wcstring);

    const wchar_t* szWCHAR = strUTF.c_str();

    cout << szWCHAR << '\n';


    char *buffer = new char[origsize / 2 + 1];

    WideCharToMultiByte(CP_ACP, 0, szWCHAR, -1, buffer, 256, NULL, NULL);

    cout << buffer;
    _getch();
}
Vladyslav K
  • 2,178
  • 3
  • 19
  • 25
  • 2
    Possible duplicate of [How to convert from UTF-8 to ANSI using standard c++](https://stackoverflow.com/questions/17562736/how-to-convert-from-utf-8-to-ansi-using-standard-c) – Daniel Waechter Apr 06 '18 at 16:55
  • @DanielWaechter maybe but I am so bad that I can’t reuse that code – Vladyslav K Apr 06 '18 at 16:57
  • You'll have to do something similar to that or find a third-party library. That's about as easy as encoding conversions get in standard C++. – user4581301 Apr 06 '18 at 16:59
  • The 1st snippet is wrong because it does not use WideCharToMultiByte() for the second conversion. The 2nd snippet is wrong because it uses mbstowcs() on a string that was already read with the wrong encoding. Pursue the 1st snippet. – Hans Passant Apr 06 '18 at 17:06
  • 2
    `Павло` is the UTF-8 encoded form of `Павло` (not `Петро`, which would be `Петро`) being misinterpreted as Windows-1252. The bytes are the same, so simply interpret them as UTF-8 instead. There is nothing to convert to make it UTF-8, it already is. If you want to convert it to UTF-16 for use in Win32 APIs, that is separate issue. – Remy Lebeau Apr 06 '18 at 19:50
  • Also, you always have to tell your compiler the encoding (-source-charset) of your source file (actually when you give anybody or program any text file). And, when you use literal strings (without an encoding prefix), what you tell your compiler to use as the execution character encoding (-execution-charset ) will make a big difference. So, one would really wonder what you are doing when you put a string like "Павло" and treat it like it's human text. (In this case, of course, you are running a text; But is the test failing because of the code or because of the compiler arguments?) – Tom Blodget Apr 06 '18 at 23:40

2 Answers2

4

There are a few options

  1. Using Windows API

    Convert your UTF-8 to system UTF-16LE using MultiByteToWideChar and then from UTF-16LE to CP1251 (Cyrillic is 1251 not 1252) over WideCharToMultiByte

  2. Using MS MLAGN API

  3. Using GNU ICONV library

  4. Using IBM ICU

If you simply need to output your UNICODE into console, check this

Victor Gubin
  • 2,782
  • 10
  • 24
2

This is a printing issue. Your first function is correct, you can test it MessageBoxW:

wchar_t *wbuf = CodePageToUnicode(CP_UTF8, "Павло");
if(wbuf)
{
    MessageBoxW(0, wbuf, 0, 0);
    delete[]buf;
}

Output

"Павло" (not the same as what you said!)

You can print wide characters with std::wcout, or simplify the function to print using 1251 code page as follows:

#include <iostream>
#include <string>
#include <Windows.h>

int main()
{
    char *buf = "Павло";
    int size;

    size = MultiByteToWideChar(CP_UTF8, 0, buf, -1, 0, 0);
    std::wstring wstr(size, 0);
    MultiByteToWideChar(CP_UTF8, 0, buf, -1, &wstr[0], size);

    int codepage = 1251;
    size = WideCharToMultiByte(codepage, 0, &wstr[0], -1, 0, 0, 0, 0);
    std::string str(size, 0);
    WideCharToMultiByte(codepage, 0, &wstr[0], -1, &str[0], size, 0, 0);

    SetConsoleOutputCP(codepage);
    std::cout << str << "\n";
    return 0;
}
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77