2nd Update: I found a very simple solution to this actually not that hard problem, only one day after asking. But people seem to be small-minded so there are three close votes already:
Duplicate of "How to use unicode characters in Windows command line?" (1x):
Obviously not, which has been clarified in the comments. This is not about the Windows command line tool, which I do not use.
Unclear what you're asking (1x):
Then you must suffer from functional analphabetism. I cannot be any more concrete when I ask, for example "Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol?" (marked bold for better visibility, indeed) and state that this would be sufficient to answer the question (and even explain why). Seriously, there are even pictures to show the problem. Furthermore, my own existing answer should clarify it even more. Your own deficiencies are not sufficient to declare something as too hard to understand.
Too broad (1x) ("Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer [...]"):
This must be another issue with functional analphabetism. I stated clearly that a single way to solve the problem (which I have already found) is sufficient. You can identify an adequate answer as follows: Take a look at the accepted answer of my own. Alternatively, use your brain to interprete my well-defined words if you are able to, which several people on this plattform unfortunately seem not.
There is, however, an actual reason to close this question: It has already been solved. But there is no such option for a close vote. So, cleary, Stack Exchange supports that there may be found alternative solutions. Since I am a curious person, I am also interested in alternative ways to solve this. If your lack of intelligence does not cope well with understanding what the problem is and that it is quite relevant under certain environments (e.g. such that use Windows, C++ in Eclipse CDT, UTF-8, but no Visual Studio and no Windows Console), then you can just leave without standing in the way of other people to satisfy their curiosity. Thanks!
1st Update: I used
app.exe > out.txt 2>&1
which generates a file without these formatting issues. So the problem is that usually std::cout does this splitting but the underlying control (which receives the char sequence) has to handle correct reassembling? (Unfortunately nothing seems to handle it on Windows, except file streams. So I still need to circumvent this. Preferably without writing to files first and displaying their content -- which of course works.)
On the system that I use (Windows 7; MinGW-w64 (GCC 8.1 for Windows)), there is a bug with std::cout
so that UTF-8 encoded strings are printed out before they are reassembled, even if they were disassembled internally by std::cout
by passing a large string. The following code explains how the bug seems to behave. Note that, however, the faulty displays appear to be random, i.e. the way std::cout
slices up (equal) std::string
objects is not equivalent for every execution of the program. But the problems appear consistently at indices which are multiples of 1024, which is how I concluded that behavior.
#include <iostream>
#include <sstream>
void myFaultyOutput();
void simulatedFaultyBehavior();
int main()
{
myFaultyOutput();
//simulatedFaultyBehavior();
}
void myFaultyOutput() {
std::stringstream ss; // Note that ss is built correctly (which could be shown by saving ss.str() to a file).
ss << "...";
for (int i = 0; i < 20; i++) {
for (int j = 0; j < 341; j++)
ss << u8"\u301A";
ss << "\n..";
}
std::cout << ss.str() << std::endl; // Problem occurs here, with cout.
// Note that converting ss.str() to UTF-16 std::wstring and using std::wcout results in std::wcout not
// displaying anything, not even ASCII characters in the future (until restarting the application).
}
// To display the problem on well-behaved systems ; just imagine the output would not contain newlines, while the faulty formatted characters remain.
void simulatedFaultyBehavior() {
std::stringstream ss;
int amount = 2000;
for (int j = 0; j < amount; j++)
ss << u8"\u301A";
std::string s = ss.str();
std::cout << "s.length(): " << s.length() << std::endl; // amount * 3
while (s.length() > 1024) {
std::cout << s.substr(0, 1024) << std::endl;
s = s.substr(1024);
}
std::cout << s << std::endl;
}
To circumvent this behavior, I would like to split up large strings (which I receive as such from an API) manually in parts of lengths less than 1024 chars (and then call std::cout separately on each of them). But I don't know which chars actually are just a non-ending part of an UTF-8 symbol and the built-in Unicode converters also seem to be unreliable (possibly also system-dependent?). Is there an easy way to determine whether a char in a std::string
is a non-ending part of an UTF-8 symbol? The following quote explains why answering this question would be sufficient.
An UTF-8 character can, for example, consist of three chars. So if one splits a string into two parts, it should keep those three characters together. Otherwise, one has to do what the existing GUI controls clearly are not able to do consistently. For instance, reassembling UTF-8-characters that have been split into pieces.
Better ideas to circumvent the problem (others than "Don't use Windows" / "Don't use UTF-8" / "Don't use cout", of course) are also welcome.
Note that this question is unrelated to the Windows Console (I do not use it -- things are displayed in Eclise and optionally on wxWidgets UI elements, which display UTF-8 correctly). It is also unrelated to MSVC (I use the MinGW compiler, as I have mentioned). In the code is also mentioned that using std::wcout with UTF-16 does not work at all (due to another MinGW an Eclipse bug). The bug results from UI controls being unable to handle what std::cout
does (which may be intentional or not). Furthermore, everything usually works fine, except for those UTF-8 symbols that were split up into different chars (e.g. \u301A into \u0003 + \u001A) at indices which are multiples of 1024 (and only randomly). This behavior implies already that most assumptions of commenters are false. Please consider the code -- especially its comments -- carefully rather than rushing to conclusions.
To clarify the display issue when calling myFaultyOutput()
:
- In Eclipse CDT:
- In Scintilla (implemented in wxWidgets as wxStyledTextCtrl):