Display large UTF-8-encoded strings for standard output decently, despite Windows or MinGW bugs

Question

2nd Update: I found a very simple solution to this actually not that hard problem, only one day after asking. But people seem to be small-minded so there are three close votes already:

Duplicate of "How to use unicode characters in Windows command line?" (1x):

Obviously not, which has been clarified in the comments. This is not about the Windows command line tool, which I do not use.

Unclear what you're asking (1x):

Then you must suffer from functional analphabetism. I cannot be any more concrete when I ask, for example "Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol?" (marked bold for better visibility, indeed) and state that this would be sufficient to answer the question (and even explain why). Seriously, there are even pictures to show the problem. Furthermore, my own existing answer should clarify it even more. Your own deficiencies are not sufficient to declare something as too hard to understand.

Too broad (1x) ("Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer [...]"):

This must be another issue with functional analphabetism. I stated clearly that a single way to solve the problem (which I have already found) is sufficient. You can identify an adequate answer as follows: Take a look at the accepted answer of my own. Alternatively, use your brain to interprete my well-defined words if you are able to, which several people on this plattform unfortunately seem not.

There is, however, an actual reason to close this question: It has already been solved. But there is no such option for a close vote. So, cleary, Stack Exchange supports that there may be found alternative solutions. Since I am a curious person, I am also interested in alternative ways to solve this. If your lack of intelligence does not cope well with understanding what the problem is and that it is quite relevant under certain environments (e.g. such that use Windows, C++ in Eclipse CDT, UTF-8, but no Visual Studio and no Windows Console), then you can just leave without standing in the way of other people to satisfy their curiosity. Thanks!

1st Update: I used app.exe > out.txt 2>&1 which generates a file without these formatting issues. So the problem is that usually std::cout does this splitting but the underlying control (which receives the char sequence) has to handle correct reassembling? (Unfortunately nothing seems to handle it on Windows, except file streams. So I still need to circumvent this. Preferably without writing to files first and displaying their content -- which of course works.)

On the system that I use (Windows 7; MinGW-w64 (GCC 8.1 for Windows)), there is a bug with std::cout so that UTF-8 encoded strings are printed out before they are reassembled, even if they were disassembled internally by std::cout by passing a large string. The following code explains how the bug seems to behave. Note that, however, the faulty displays appear to be random, i.e. the way std::cout slices up (equal) std::string objects is not equivalent for every execution of the program. But the problems appear consistently at indices which are multiples of 1024, which is how I concluded that behavior.

#include <iostream>
#include <sstream>

void myFaultyOutput();
void simulatedFaultyBehavior();

int main()
{
    myFaultyOutput();
    //simulatedFaultyBehavior();
}

void myFaultyOutput() {
    std::stringstream ss; // Note that ss is built correctly (which could be shown by saving ss.str() to a file).
    ss << "...";
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 341; j++)
            ss << u8"\u301A";
        ss << "\n..";
    }
    std::cout << ss.str() << std::endl; // Problem occurs here, with cout.
    // Note that converting ss.str() to UTF-16 std::wstring and using std::wcout results in std::wcout not
    // displaying anything, not even ASCII characters in the future (until restarting the application).
}

// To display the problem on well-behaved systems ; just imagine the output would not contain newlines, while the faulty formatted characters remain.
void simulatedFaultyBehavior() {
    std::stringstream ss;
    int amount = 2000;
    for (int j = 0; j < amount; j++)
        ss << u8"\u301A";
    std::string s = ss.str();
    std::cout << "s.length(): " << s.length() << std::endl; // amount * 3
    while (s.length() > 1024) {
        std::cout << s.substr(0, 1024) << std::endl;
        s = s.substr(1024);
    }
    std::cout << s << std::endl;
}

To circumvent this behavior, I would like to split up large strings (which I receive as such from an API) manually in parts of lengths less than 1024 chars (and then call std::cout separately on each of them). But I don't know which chars actually are just a non-ending part of an UTF-8 symbol and the built-in Unicode converters also seem to be unreliable (possibly also system-dependent?). Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol? The following quote explains why answering this question would be sufficient.

An UTF-8 character can, for example, consist of three chars. So if one splits a string into two parts, it should keep those three characters together. Otherwise, one has to do what the existing GUI controls clearly are not able to do consistently. For instance, reassembling UTF-8-characters that have been split into pieces.

Better ideas to circumvent the problem (others than "Don't use Windows" / "Don't use UTF-8" / "Don't use cout", of course) are also welcome.

Note that this question is unrelated to the Windows Console (I do not use it -- things are displayed in Eclise and optionally on wxWidgets UI elements, which display UTF-8 correctly). It is also unrelated to MSVC (I use the MinGW compiler, as I have mentioned). In the code is also mentioned that using std::wcout with UTF-16 does not work at all (due to ~~another MinGW~~ an Eclipse bug). The bug results from UI controls being unable to handle what std::cout does (which may be intentional or not). Furthermore, everything usually works fine, except for those UTF-8 symbols that were split up into different chars (e.g. \u301A into \u0003 + \u001A) at indices which are multiples of 1024 (and only randomly). This behavior implies already that most assumptions of commenters are false. Please consider the code -- especially its comments -- carefully rather than rushing to conclusions.

To clarify the display issue when calling myFaultyOutput():

In Eclipse CDT:

In Scintilla (implemented in wxWidgets as wxStyledTextCtrl):

I don't see any code in the above to ensure that the Windows console is using UTF-8. By default `std::cout` is is MBCS (your local code-page) and std::wcout is UTF-16. — Richard Critten, Sep 24 '19 at 13:21
Since on most indices (and on all strings of lengths less than 1024) the formatting and displaying works fine, doesn't it obviously use UTF-8? It seems to be the default for my MinGW. Did you miss the information that I do not use MSVC? However, what would you suggest to ensure it? I can add that to the code. — xamid, Sep 24 '19 at 13:33
What bug are you referring to? There's no string assembly involved. Did you configure your terminal to use UTF8? If not, the terminal will display individual bytes. — Panagiotis Kanavos, Sep 24 '19 at 13:56
Possible duplicate of [How to use unicode characters in Windows command line?](https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line) — Panagiotis Kanavos, Sep 24 '19 at 13:57
Windows uses Unicode (UTF16) natively. The console understands UTF16 but anything else (including UTF8) is displayed using the configured codepage. You may only have to issue `chcp 65001` before starting your application — Panagiotis Kanavos, Sep 24 '19 at 14:01
I do not use a terminal but Eclipse to display standard output. Sometimes I also redirect it to a wxStyledTextCtrl of wxWidgets, which displays the actual splits (which you can see from the numbers). It really does not matter here what the Windows Console does. — xamid, Sep 24 '19 at 14:07
As for not using MSVC, `std` is the standard namespace. `wcout` is a standard stream. C++14 (if not 11) introduced UTF16 and UTF32 character types (char16_t, char32_t), strings (u16string, u32string) and streams. — Panagiotis Kanavos, Sep 24 '19 at 14:07
@xamid update your question then and explain what you do, what the *actual* problem is. Widgets aren't `just` a redirection, they are UI elements. As I already explained, Windows is a Unicode OS. Those widgets are meant to display UTF16, not UTF8. You'll have to use the UTF16 types (u16string, char16_t etc) or convert your UTF8 strings to UTF16 before display — Panagiotis Kanavos, Sep 24 '19 at 14:09
@PanagiotisKanavos I know all these things, but they are unrelated to the question. I think the question already clarifies this. The UI elements also have nothing to do with the issue. They display UTF-8 correctly. The bug is within std::cout from MinGW, as I have explicitly stated in the code. Also the code explains how your UTF-16 approach fails. — xamid, Sep 24 '19 at 14:10
Can you reproduce the bug *without* Eclipse/wxWidgets? I've compiled & ran your code with MinGW GCC 9.2 (from MSYS2 packages), with `g++ 1.cpp && a.exe >1.txt`. The resulting file looks completely normal. Can you run this command and check if the result looks good? — HolyBlackCat, Sep 24 '19 at 15:04
@HolyBlackCat That's a good call. I used `app.exe > out.txt 2>&1` which generates a file without these formatting issues. So the problem is that usually std::cout does this splitting but the underlying control (which receives the char sequence) has to handle correct reassembling? (Unfortunately nothing seems to handle it on Windows, except file streams. So I still need to circumvent this. Preferably without writing to files first and displaying their content -- which of course works.) — xamid, Sep 24 '19 at 15:12
@xamid What do you call 'splitting' and 'reassembling'? If I had to guess, `std::cout` probably doesn't even understand that you give it UTF-8; it just outputs the exact bytes you give it (and maybe replaces `\n` with `\r\n`). — HolyBlackCat, Sep 24 '19 at 15:15
@HolyBlackCat Just as I have described it in the question. What actually would be sufficient to answer this question, is, as mentioned: "**Is there an easy way to determine whether a char in a `std::string` is a non-ending part of an UTF-8 symbol?**" — xamid, Sep 24 '19 at 15:27
*"described it in the question"* I've read the question again. No, you don't describe what 'splitting' and 'reassembling' is. *"easy way to determine whether a char ... is a non-ending part of an UTF-8 symbol"* It would be easier to check if a byte is the *first* byte of a symbol (by checking if the most significant bit is 0, see https://en.wikipedia.org/wiki/UTF-8 ) — HolyBlackCat, Sep 24 '19 at 15:43
It just means what those words usually mean, but w.r.t. sequences of chars. An UTF-8 character can, for example, consist of three chars. So if one splits a string into two parts, it should keep those three characters together. Otherwise, one has to do what the existing GUI controls clearly are not able to do consistently. For instance, reassembling UTF-8-characters that have been split into pieces. — xamid, Sep 24 '19 at 15:52
"I do not use a terminal but Eclipse to display standard output." Then you need to open a bug against Eclipse and wait, or switch to a better working IDE. — n. m. could be an AI, Sep 24 '19 at 16:27
@n.m. Just for clarification, Eclipse works absolutely fine w.r.t. displaying the output stream. Either Windows or `std::cout` messed up the output stream (by splitting it up inadequately), as my [solution](https://stackoverflow.com/a/58099629/3410351) -- where some Windows API calls (to translate stdout to UFT-8) fix everything -- clarifies. Frankly, I have stated that the inadequate splitting is the problem from the very beginning in the question. Maybe you should not presuppose low capabilities of strangers but first evaluate your own. — xamid, Sep 26 '19 at 15:33
Once again, I misunderstood your problem and answered a question you did not ask. I have retracted the answer. I am unable to reproduce your results. I cannot recommend a solution to a problem I don't see. — n. m. could be an AI, Sep 26 '19 at 15:39
@n.m. I never asked you to recommend a solution or implied that you should. Also I clarified the system details (which you apparently did not reproduce, indeed). Note that different Windows versions may be relevant (i.e. Windows 10 might have fixed it already). I also clarified that stdout shall not be redirected into a file. — xamid, Sep 26 '19 at 15:47
I do not have a slightest idea how to answer to your question as I now understand it. I cannot imagine what kind of phenomenon you are seeing. I cannot reproduce it with settings similar to yours, in particular with scite/scintilla. I didn't try eclipse because I don't have and can't install Java. I highly doubt it's a MinGW bug though. As far as I understand, MinGW does not assemble, disassemble, reassemble, or split UTF-8. It has no idea your stream contains UTF-8 data. It sends out bytes, and it doesn't care if it sends them to a file or to a console or to a pipe. — n. m. could be an AI, Sep 26 '19 at 16:59
If you need any further help I'm afraid I will need access to your system, otherwise I don't really have anything to add. Good luck. — n. m. could be an AI, Sep 26 '19 at 16:59

xamid · Accepted Answer · 2022-07-22T10:32:37.333

I elaborated a fairly simple workaround by experimenting, of which I am surprised that nobody knew (I found nothing like that online).

N.m.'s attempted answer gave a good hint with mentioning the platform-specific function _setmode. What it does "by design" (according to this answer and this article) is to set the file translation mode, which is how the in- and output streams according to the process are handled. But at the same time, it invalidates using std::ostream / std::istream but dictates to use std::wostream / std::wistream for decently formatted in- and output streams.

For instance, using _setmode(_fileno(stdout), _O_U8TEXT) leads to that std::wcout now works well with outputting std::wstring as UTF-8, but std::cout prints out garbage characters, even on ASCII arguments. But I want to be able to mainly use std::string, especially std::cout for output. As I have mentioned, it is a rare case that the formatting for std::cout fails, so only in cases where I print out strings that may lead to this issue (potential multi-char-encoded-characters at indices of at least 1024) I want to use a special output function, say coutUtf8String(string s).

The default (untranslated) mode of _setmode is _O_BINARY. We can temporarily switch modes. So why not just switch to _O_U8TEXT, convert the UTF-8 encoded std::string object to std::wstring, use std::wcout on it, and then switch back to _O_BINARY? To stay platform-independent, one can just define the usual std::cout call when not on Windows. Here is the code:

#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
#include <fcntl.h> // Also includes the non-standard file <io.h>
                   // (POSIX compatibility layer) to use _setmode on Windows NT.
#ifndef _O_U8TEXT // Some GCC distributions such as TDM-GCC 9.2.0 require this explicit
                  // definition since, depending on __MSVCRT_VERSION__, they might
                  // not define it.
#define _O_U8TEXT 0x40000
#endif
#endif

void coutUtf8String(string s) {
#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
    if (s.length() > 1024) {
        // Set translation mode of wcout to UTF-8, renders cout unusable "by design"
        // (see https://developercommunity.visualstudio.com/t/_setmode_filenostdout-_O_U8TEXT;--/394790#T-N411680).
        if (_setmode(STDOUT_FILENO, _O_U8TEXT) != -1) {
            wcout << utf8toWide(s) << flush; // We must flush before resetting the mode.
             // Set translation mode of wcout to untranslated, renders cout usable again.
            _setmode(STDOUT_FILENO, _O_BINARY);
        } else
            // Let's use wcout anyway. Since no sink (such as Eclipse's console
            // window) is attached when _setmode fails, and such sinks seem to be
            // the cause for wcout to fail in default mode. The UI console view
            // is filled properly like this, regardless of translation modes.
            wcout << utf8toWide(s) << flush;
    } else
        cout << s << flush;
#else
    cout << s << flush;
#endif
}

wstring utf8toWide(const char* in) {
    wstring out;
    if (in == nullptr)
        return out;
    uint32_t codepoint = 0;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            if (codepoint > 0xffff) {
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            } else if (codepoint < 0xd800 || codepoint >= 0xe000)
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

This solution is especially convenient since it does not factually deprecate UTF-8, std::string or std::cout which are mainly used for good reasons, but simply uses std::string itself and sustains platform-independency. I rather agree with this answer that adding wchar_t (and all the redundant rubbish that comes with it, such as std::wstring, std::wstringstream, std::wostream, std::wistream, std::wstreambuf) to C++ was a mistake. Only because Microsoft takes bad design decisions, one should not adopt their mistakes but rather circumvent them.

Visual confirmation:

Microsoft does not write C++ standard or the library an implementation ships with. You have deep-seated issues with Windows. — Tanveer Badar, Sep 25 '19 at 13:54
@TanveerBadar Exactly. That is why my statement "I rather agree [...] that adding wchar_t [...] to C++ was a mistake" was criticism towards the C++ committee, which adapted to Microsoft's mistakes instead of forcing them to behave decently. It doesn't, however, lessen my criticism towards Microsoft. — xamid, Sep 25 '19 at 14:00

Display large UTF-8-encoded strings for standard output decently, despite Windows or MinGW bugs

1 Answers1