0

I'm converting wstring to string with std::codecvt_utf8 as described in this question, but when I tried Greek or Chinese alphabet symbols are corrupted, I can see it in the debug Locals window, for example 日本 became "日本"

std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; //also tried codecvt_utf8_utf16
std::string str = myconv.to_bytes(wstr);

What am I doing wrong?

Random
  • 249
  • 1
  • 10
  • 3
    Try to output the numerical values. Odds are, that it's just rendered incorrectly in the terminal or whatever output method you're using. After all, there is nothing that can reliably distinguish a byte string encoded in UTF-8 from one encoded in Latin-1. – Ulrich Eckhardt Jan 05 '22 at 18:27
  • 1
    If you are using Visual Studio the debugger assumes some other encoding for the `std::string` than UTF8, at least by default. (I forgot which one specifically.) `std::string` does not contain any information about encoding. If you use the string somewhere where it is interpreted as UTF8-encoded, the result will be correct. – user17732522 Jan 05 '22 at 18:36
  • @UlrichEckhardt so it also means I can't use std::transform on the str? – Random Jan 05 '22 at 18:43
  • @user17732522 yes, I'm using VS. So that means I also can't pass this str to something like std::transform? – Random Jan 05 '22 at 18:47
  • @Random Everything is correct with your `std::string`. It contains a UTF8 encoded string. But `std::string` itself doesn't know anything about that. It simply holds bytes. If you use `std::transform` on it, you will transform byte-by-byte in the UTF8 encoding, not unicode character-by-character. Depends on what you intend to do. – user17732522 Jan 05 '22 at 18:50
  • Related: https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring – JosefZ Jan 05 '22 at 18:52
  • Have a look at : https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte – Pepijn Kramer Jan 05 '22 at 18:52
  • 4
    You can view a UTF-8-encoded string in the Watch window. Use `str,s8`. See https://learn.microsoft.com/en-us/visualstudio/debugger/format-specifiers-in-cpp?view=vs-2022 – Mark Tolonen Jan 05 '22 at 19:30

1 Answers1

2

std::string simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.

Therefore when the contents of a std::string need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.

I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8.

But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.

As verification, python gives the following:

>>> '日本'.encode('utf8').decode('cp1252')
'日本'

Your string does seem to be the UTF8 encoding of 日本 interpreted as if it was cp1252 encoded.

Therefore the conversion seems to have worked as intended.


As mentioned by @MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8 specifier, as explained in the documentation.

user17732522
  • 53,019
  • 2
  • 56
  • 105