C++ Output UTF-8 strings as UTF-16 to std::cout

Question

I have a lot of code written based on UTF-8 using C++03, STL and Boost 1.54.
All the code outputs data to the console via std::cout or std::cerr.
I do not want to introduce a new library to my code base or switch to C++11, but I want to port the code to Windows.
Rewriting all code to either use std::wcout or std::wcerr instead of std::cout or std::cerr is not what I intend but I still want to display all on console as UTF-16.
Is there a way to change std::cout and std::cerr to transform all char based data (which is UTF-8 encoded) to wchar_t based data (which would be UTF-16 encoded) before writing it to the console?
It would be great to see a solution for this using just C++03, STL and Boost 1.54.
I found that Boost Locale has conversion functions for single strings and there is a UTF-8 to UTF-32 iterator in Boost Spirit available but I could not find any facet codecvt to transform UTF-8 to UTF-16 without using an additional library or switching to C++11.

Thanks in advance.

PS: I know it is doable with something like this, but I hope to find a better solution here.

Stdin/out are stuck with 8-bit encodings due to I/O redirection. Switch the console to utf-8 instead, call SetConsoleCP(CP_UTF8) in your main() function. No rewrite of your code required. Update the font as well, the default Terminal font is no longer appropriate, Consolas or Lucida Sans are the usual choice. — Hans Passant, Oct 07 '14 at 19:50
I am sure there is some way to work around the 8-bit encoding problem. I am aware that you can set the output console to UTF-8 on Windows, but this is not my aim. I want to get a proper UTF-16 output for Windows. — user2525536, Oct 08 '14 at 15:31
You won't get UTF-16 output on the console though, because everything written to the console is interpreted somehow - unless you use the Console API directly and don't go through `cout`. — Mark Ransom, Oct 24 '14 at 18:38

user2525536 · Answer 1 · 2015-10-31T09:39:13.020

I did not came up with a better solution than already hinted.
So I will just share the solution based on streambuf here for anyone who is interested in it. Hopefully, someone will come up with a better solution and share it here.

#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>


#if defined(_WIN32) && defined(_UNICODE) && (defined(__MSVCRT__) ||defined(_MSC_VER))
#define TEST_ARG_TYPE wchar_t
#else /* not windows, unicode */
#define TEST_ARG_TYPE char
#endif /* windows, unicode */


#ifndef _O_U16TEXT
#define _O_U16TEXT 0x20000
#endif


static size_t countValidUtf8Bytes(const unsigned char * buf, const size_t size) {
    size_t i, charSize;
    const unsigned char * src = buf;
    for (i = 0; i < size && (*src) != 0; i += charSize, src += charSize) {
        charSize = 0;
        if ((*src) >= 0xFC) {
            charSize = 6;
        } else if ((*src) >= 0xF8) {
            charSize = 5;
        } else if ((*src) >= 0xF0) {
            charSize = 4;
        } else if ((*src) >= 0xE0) {
            charSize = 3;
        } else if ((*src) >= 0xC0) {
            charSize = 2;
        } else if ((*src) >= 0x80) {
            /* Skip continuous UTF-8 character (should never happen). */
            for (; (i + charSize) < size && src[charSize] != 0 && src[charSize] >= 0x80; charSize++) {
                charSize++;
            }
        } else {
            /* ASCII character. */
            charSize = 1;
        }
        if ((i + charSize) > size) break;
    }
    return i;
}


#if defined(_WIN32) && defined(_UNICODE) && (defined(__MSVCRT__) ||defined(_MSC_VER))
#include <locale>
#include <streambuf>
#include <boost/locale.hpp>

extern "C" {
#include <fcntl.h>
#include <io.h>
#include <windows.h>

int _CRT_glob;
extern void __wgetmainargs(int *, wchar_t ***, wchar_t ***, int, int *);
}


class Utf8ToUtf16Buffer : public std::basic_streambuf< char, std::char_traits<char> > {
private:
    char * outBuf;
    FILE * outFd;
public:
    static const size_t BUFFER_SIZE = 1024;
    typedef std::char_traits<char> traits_type;
    typedef traits_type::int_type int_type;
    typedef traits_type::pos_type pos_type;
    typedef traits_type::off_type off_type;

    explicit Utf8ToUtf16Buffer(FILE * o) : outBuf(new char[BUFFER_SIZE]), outFd(o) {
        /* Initialize the put pointer. Overflow won't get called until this
         * buffer is filled up, so we need to use valid pointers.
         */
        this->setp(outBuf, outBuf + BUFFER_SIZE - 1);
    }

    ~Utf8ToUtf16Buffer() {
        delete[] outBuf;
    }
protected:
    virtual int_type overflow(int_type c);
    virtual int_type sync();
};


Utf8ToUtf16Buffer::int_type Utf8ToUtf16Buffer::overflow(Utf8ToUtf16Buffer::int_type c) {
    char * iBegin = this->outBuf;
    char * iEnd = this->pptr();
    int_type result = traits_type::not_eof(c);

    /* If this is the end, add an eof character to the buffer.
     * This is why the pointers passed to setp are off by 1
     * (to reserve room for this).
     */
    if ( ! traits_type::eq_int_type(c, traits_type::eof()) ) {
        *iEnd = traits_type::to_char_type(c);
        iEnd++;
    }

    /* Calculate output data length. */
    int_type iLen = static_cast<int_type>(iEnd - iBegin);
    int_type iLenU8 = static_cast<int_type>(
        countValidUtf8Bytes(reinterpret_cast<const unsigned char *>(iBegin), static_cast<size_t>(iLen))
    );

    /* Convert string to UTF-16 and write to defined file descriptor. */
    if (fwprintf(this->outFd, boost::locale::conv::utf_to_utf<wchar_t>(std::string(outBuf, outBuf + iLenU8)).c_str()) < 0) {
        /* Failed to write data to output file descriptor. */
        result = traits_type::eof();
    }

    /* Reset the put pointers to indicate that the buffer is free. */
    if (iLenU8 == iLen) {
        this->setp(outBuf, outBuf + BUFFER_SIZE + 1);
    } else {
        /* Move incomplete UTF-8 characters remaining in buffer. */
        const size_t overhead = static_cast<size_t>(iLen - iLenU8);
        memmove(outBuf, outBuf + iLenU8, overhead);
        this->setp(outBuf + overhead, outBuf + BUFFER_SIZE + 1);
    }

    return result;
}


Utf8ToUtf16Buffer::int_type Utf8ToUtf16Buffer::sync() {
    return traits_type::eq_int_type(this->overflow(traits_type::eof()), traits_type::eof()) ? -1 : 0;
}

#endif /* windows, unicode */


int test_main(int argc, TEST_ARG_TYPE ** argv);


#if defined(_WIN32) && defined(_UNICODE) && (defined(__MSVCRT__) ||defined(_MSC_VER))
int main(/*int argc, char ** argv*/) {
    wchar_t ** wenpv, ** wargv;
    int wargc, si = 0;
    /* this also creates the global variable __wargv */
    __wgetmainargs(&wargc, &wargv, &wenpv, _CRT_glob, &si);
    /* enable UTF-16 output to standard output console */
    _setmode(_fileno(stdout), _O_U16TEXT);
    std::locale::global(boost::locale::generator().generate("UTF-8"));
    Utf8ToUtf16Buffer u8cout(stdout);
    std::streambuf * out = std::cout.rdbuf();
    std::cout.rdbuf(&u8cout);
    /* process user defined main function */
    const int result = test_main(wargc, wargv);
    /* revert stream buffers to let cout clean up remaining memory correctly */
    std::cout.rdbuf(out);
    return result;
#else /* not windows or unicode */
int main(int argc, char ** argv) {
    return test_main(argc, argv);
#endif /* windows, unicode */
}

int test_main(int /*argc*/, TEST_ARG_TYPE ** /*argv*/) {
    const std::string str("\x61\x62\x63\xC3\xA4\xC3\xB6\xC3\xBC\xE3\x81\x82\xE3\x81\x88\xE3\x81\x84\xE3\x82\xA2\xE3\x82\xA8\xE3\x82\xA4\xE4\xBA\x9C\xE6\xB1\x9F\xE6\x84\x8F");

    for (size_t i = 1; i <= str.size(); i++) {
        const std::string part(str.begin(), str.begin() + i);
        const size_t validByteCount = countValidUtf8Bytes(reinterpret_cast<const unsigned char *>(part.c_str()), part.size());
        wprintf(L"i = %u, v = %u\n", i, validByteCount);
        const std::string valid(str.begin(), str.begin() + validByteCount);
        std::cout << valid << std::endl;
        std::cout.flush();
        for (size_t j = 0; j < part.size(); j++) {
            wprintf(L"%02X", static_cast<int>(part[j]) & 0xFF);
        }
        wprintf(L"\n");
    }

    return EXIT_SUCCESS;
}

Brandon · Answer 2 · 2014-10-07T20:25:58.127

I feel like this is probably a bad idea.. but I think it should still be seen as it does work provided that the console has the right font..

#include <iostream>
#include <windows.h>
//#include <io.h>
//#include <fcntl.h>

std::wstring UTF8ToUTF16(const char* utf8)
{
    std::wstring utf16;
    int len = MultiByteToWideChar(CP_UTF8, 0, utf8, -1, NULL, 0);
    if (len > 1)
    {
        utf16.resize(len);
        MultiByteToWideChar(CP_UTF8, 0, utf8, -1, &utf16[0], len);
    }
    return utf16;
}

std::ostream& operator << (std::ostream& os, const char* data)
{
    //_setmode(_fileno(stdout), _O_U16TEXT);
    SetConsoleCP(1200);
    std::wstring str = UTF8ToUTF16(data);
    DWORD slen = str.size();
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), slen, &slen, nullptr);

    MessageBoxW(NULL, str.c_str(), L"", 0);
    return os;
}

std::ostream& operator << (std::ostream& os, const std::string& data)
{
    //_setmode(_fileno(stdout), _O_U16TEXT);
    SetConsoleCP(1200);
    std::wstring str = UTF8ToUTF16(&data[0]);
    DWORD slen = str.size();
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), slen, &slen, nullptr);
    return os;
}

std::wostream& operator <<(std::wostream& os, const wchar_t* data)
{
    DWORD slen = wcslen(data);
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data, slen, &slen, nullptr);
    return os;
}

std::wostream& operator <<(std::wostream& os, const std::wstring& data)
{
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), nullptr, nullptr);
    return os;
}

int main()
{
    std::cout<<"Россия";
}

And now cout and std::wcout both use the WriteConsoleW function.. You'd have to overload it for const char*, char*, std::string, char, etc.. whatever you need.. Maybe template it.

So, now you have a handful of ambiguous overloads for `operator<<`. What do. I think the linked approach with a custom streambuf would already be superior. And more performant as it allows buffering individual insertions. — sehe, Oct 07 '14 at 21:29
I agree with sehe. This solution does not look better than the streambuf approach. I was looking for a little bit better way as you need to handle not just the mentioned but a far wider range of overloads, too... — user2525536, Oct 08 '14 at 15:39

C++ Output UTF-8 strings as UTF-16 to std::cout

2 Answers2