How to convert UTF8 char array to Windows 1252 char array

Question

I am noob in C++ so I am very sorry for asking stupid question.

I have a piece of text: ÐŸÐ°Ð²Ð»Ð¾

I get it somewhere from console output in piece of code I am working on. I know that this is cyrillic word hidded behind it. It's real value is "Петро".

With online encoding detector I have found that to read this text properly, I have to convert it from UTF-8 to Windows 1252.

How can I do it with code?

I have tried this, it gives some results, but it outputs 5 questionmarks (at least lenght expected)

    wchar_t *CodePageToUnicode(int codePage, const char *src)
{
    if (!src) return 0;
    int srcLen = strlen(src);
    if (!srcLen)
    {
        wchar_t *w = new wchar_t[1];
        w[0] = 0;
        return w;
    }

    int requiredSize = MultiByteToWideChar(codePage,
        0,
        src, srcLen, 0, 0);

    if (!requiredSize)
    {
        return 0;
    }

    wchar_t *w = new wchar_t[requiredSize + 1];
    w[requiredSize] = 0;

    int retval = MultiByteToWideChar(codePage,
        0,
        src, srcLen, w, requiredSize);
    if (!retval)
    {
        delete[] w;
        return 0;
    }

    return w;
}

char *UnicodeToCodePage(int codePage, const wchar_t *src)
{
    if (!src) return 0;
    int srcLen = wcslen(src);
    if (!srcLen)
    {
        char *x = new char[1];
        x[0] = '\0';
        return x;
    }

    int requiredSize = WideCharToMultiByte(codePage,
        0,
        src, srcLen, 0, 0, 0, 0);

    if (!requiredSize)
    {
        return 0;
    }

    char *x = new char[requiredSize + 1];
    x[requiredSize] = 0;

    int retval = WideCharToMultiByte(codePage,
        0,
        src, srcLen, x, requiredSize, 0, 0);
    if (!retval)
    {
        delete[] x;
        return 0;
    }

    return x;
}
int main()
{
    const char *text = "ÐŸÐ°Ð²Ð»Ð¾";

    // Now convert utf-8 back to ANSI:
    wchar_t *wText2 = CodePageToUnicode(65001, text);

    char *ansiText = UnicodeToCodePage(1252, wText2);
    cout << ansiText;
    _getch();

}

also tried this, but it's not working propery

int main()
{
    const char *orig = "ÐŸÐ°Ð²Ð»Ð¾";
    size_t origsize = strlen(orig) + 1;
    const size_t newsize = 100;
    size_t convertedChars = 0;
    wchar_t wcstring[newsize];
    mbstowcs_s(&convertedChars, wcstring, origsize, orig, _TRUNCATE);
    wcscat_s(wcstring, L" (wchar_t *)");

    std::wstring strUTF(wcstring);

    const wchar_t* szWCHAR = strUTF.c_str();

    cout << szWCHAR << '\n';


    char *buffer = new char[origsize / 2 + 1];

    WideCharToMultiByte(CP_ACP, 0, szWCHAR, -1, buffer, 256, NULL, NULL);

    cout << buffer;
    _getch();
}

Possible duplicate of [How to convert from UTF-8 to ANSI using standard c++](https://stackoverflow.com/questions/17562736/how-to-convert-from-utf-8-to-ansi-using-standard-c) — Daniel Waechter, Apr 06 '18 at 16:55
@DanielWaechter maybe but I am so bad that I can’t reuse that code — Vladyslav K, Apr 06 '18 at 16:57
You'll have to do something similar to that or find a third-party library. That's about as easy as encoding conversions get in standard C++. — user4581301, Apr 06 '18 at 16:59
The 1st snippet is wrong because it does not use WideCharToMultiByte() for the second conversion. The 2nd snippet is wrong because it uses mbstowcs() on a string that was already read with the wrong encoding. Pursue the 1st snippet. — Hans Passant, Apr 06 '18 at 17:06
`ÐŸÐ°Ð²Ð»Ð¾` is the UTF-8 encoded form of `Павло` (not `Петро`, which would be `ÐŸÐµÑ‚Ñ€Ð¾`) being misinterpreted as Windows-1252. The bytes are the same, so simply interpret them as UTF-8 instead. There is nothing to convert to make it UTF-8, it already is. If you want to convert it to UTF-16 for use in Win32 APIs, that is separate issue. — Remy Lebeau, Apr 06 '18 at 19:50
Also, you always have to tell your compiler the encoding (-source-charset) of your source file (actually when you give anybody or program any text file). And, when you use literal strings (without an encoding prefix), what you tell your compiler to use as the execution character encoding (-execution-charset ) will make a big difference. So, one would really wonder what you are doing when you put a string like "ÐŸÐ°Ð²Ð»Ð¾" and treat it like it's human text. (In this case, of course, you are running a text; But is the test failing because of the code or because of the compiler arguments?) — Tom Blodget, Apr 06 '18 at 23:40

Victor Gubin · Answer 1 · 2018-04-06T17:24:17.370

4

There are a few options

Using Windows API

Convert your UTF-8 to system UTF-16LE using MultiByteToWideChar and then from UTF-16LE to CP1251 (Cyrillic is 1251 not 1252) over WideCharToMultiByte
Using MS MLAGN API
Using GNU ICONV library
Using IBM ICU

If you simply need to output your UNICODE into console, check this

edited Apr 06 '18 at 17:24

answered Apr 06 '18 at 16:58

Victor Gubin

2,782
10
24

score 2 · Answer 2 · answered Apr 06 '18 at 19:44

This is a printing issue. Your first function is correct, you can test it MessageBoxW:

wchar_t *wbuf = CodePageToUnicode(CP_UTF8, "ÐŸÐ°Ð²Ð»Ð¾");
if(wbuf)
{
    MessageBoxW(0, wbuf, 0, 0);
    delete[]buf;
}

Output

"Павло" (not the same as what you said!)

You can print wide characters with std::wcout, or simplify the function to print using 1251 code page as follows:

#include <iostream>
#include <string>
#include <Windows.h>

int main()
{
    char *buf = "ÐŸÐ°Ð²Ð»Ð¾";
    int size;

    size = MultiByteToWideChar(CP_UTF8, 0, buf, -1, 0, 0);
    std::wstring wstr(size, 0);
    MultiByteToWideChar(CP_UTF8, 0, buf, -1, &wstr[0], size);

    int codepage = 1251;
    size = WideCharToMultiByte(codepage, 0, &wstr[0], -1, 0, 0, 0, 0);
    std::string str(size, 0);
    WideCharToMultiByte(codepage, 0, &wstr[0], -1, &str[0], size, 0, 0);

    SetConsoleOutputCP(codepage);
    std::cout << str << "\n";
    return 0;
}

I actually figured that it works when my string is represented like this: L"\x043a\x043e\x0448\x043a\x0430". Do you know how to translate regular string to character codes? — Vladyslav K, Apr 06 '18 at 21:29
Use `L"Петро"` and UTF8 encoding in Visual Studio editor — Barmak Shemirani, Apr 06 '18 at 21:48

How to convert UTF8 char array to Windows 1252 char array

2 Answers2

Linked