1

I am representing folder paths with boost::filesystem::path which is a wstring on windows OS and I would like to convert it to std::string with the following method:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv1;
shared_dir = conv1.to_bytes(temp.wstring());

but unfortunatelly the result of the following text is this:

"c:\git\myproject\bin\árvíztűrőtükörfúrógép" -> "c:\git\myproject\bin\árvíztűrÅ‘tükörfúrógép"

What do I do wrong?

#include <string>
#include <locale>
#include <codecvt>

int main()
{
    // wide character data
    std::wstring wstr =  L"árvíztűrőtükörfúrógép";

    // wide to UTF-8
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
    std::string str = conv1.to_bytes(wstr);
}

I was checking the value of the variable in visual studio debug mode.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Csuszmusz
  • 175
  • 1
  • 4
  • 20
  • 2
    Why would you store UTF-8 in a `std::wstring`? – eerorika Sep 18 '19 at 16:44
  • 1
    Are you using C++17? If so, consider `std::filesystem` instead. `path` has a `generic_u8string()` method. that might be useful. Btw, how are you printing the converted string? Are you using `std::wcout`? You should probably not mix `std::cout` and `std::wcout` too much. – Ted Lyngmo Sep 18 '19 at 16:48
  • Not convinced by the duplicate. It may not consider the OP's specific encoding needs. Instead I'd like to see a [mcve] and a full explanation of the circumstances of the bug. – Lightness Races in Orbit Sep 18 '19 at 17:29
  • Also how do you witness this result? Are you sure you're not just misinterpreting `shared_dir` as ASCII? – Lightness Races in Orbit Sep 18 '19 at 17:32
  • @eerorika I suspect the title is just wrong/misleading though ofc we cannot be sure right now – Lightness Races in Orbit Sep 18 '19 at 17:32
  • 3
    "*but unfortunatelly the result of the following text is this*" - `árvíztűrÅ‘tükörfúrógép` is the UTF-8 encoded form of `árvíztűrőtükörfúrógép` being *misinterpreted* as ANSI instead of UTF-8. The code is fine, the data is correct, it is just the *display* of the UTF-8 data that is faulty. – Remy Lebeau Sep 18 '19 at 19:06
  • @TedLyngmo Unfortunatelly I am only using C++14.I am using the VS locals to check the contain of the variables. – Csuszmusz Sep 19 '19 at 07:38
  • @LightnessRacesinOrbit Minimal example is added to the post. Also, how could I misinterpret shared_dir as ASCII? – Csuszmusz Sep 19 '19 at 07:44
  • 1
    @Csuszmusz There are many such ways. Bad terminal settings, for example. There's still no minimal example because you do not show us how you are witnessing the behaviour. What do you _do_ with `shared_dir` at the end of this program? Where you look at it? In what? With what settings? Your example does not even output the string! – Lightness Races in Orbit Sep 19 '19 at 12:06
  • Part1 @LightnessRacesinOrbit Thank you for your answer and the solution ideas that you are implying with it. So basically I was checking the value of the variable in visual studio debug mode. I also tried to convert the variable back with conv1.from_bytes and I got back the right result. – Csuszmusz Sep 19 '19 at 15:36
  • Part2 @LightnessRacesinOrbit Also I tried this code snippet that I inserted above in a sandbox project and I see the "árvíztűrőtükörfúrógép" correctly, so the problem should be project specific. This lead me to the following question: [possible_root_cause](https://stackoverflow.com/questions/58013410/stdlocale-throws-runtime-error-exception-to-en-us-utf-8-locale) – Csuszmusz Sep 19 '19 at 15:36
  • Okay, then I have posted an answer on that basis. – Lightness Races in Orbit Sep 19 '19 at 15:47
  • Check out my answer to [C++ Visual Studio character encoding issues](https://stackoverflow.com/a/40337240/3258851) and also [this one](https://stackoverflow.com/a/49567787/3258851). Also, since you are using Boost, consider `boost::locale::conv::utf_to_utf(wstr);`, since the `` header is deprecated in C++17. – Marc.2377 Sep 19 '19 at 15:50
  • @Marc.2377 I don't believe either of those answers are relevant. – Lightness Races in Orbit Sep 19 '19 at 16:00
  • @LightnessRacesinOrbit Having now seen your answer, I agree, you're right. I think I'll leave the comment, though. – Marc.2377 Sep 19 '19 at 16:16

1 Answers1

3

The code is fine.

You're taking a wstring that stores UTF-16 encoded data, and creating a string that stores UTF-8 encoded data.

I was checking the value of the variable in visual studio debug mode.

Visual Studio's debugger has no idea that your string stores UTF-8. A string just contains bytes. Only you (and people reading your documentation!) know that you put UTF-8 data inside it. You could have put something else inside it.

So, in the absence of anything more sensible to do, the debugger just renders the string as ASCII*. What you're seeing is the ASCII* representation of the bytes in your string.

Nothing is wrong here.

If you were to output the string like std::cout << str, and if you were running the program in a command line window set to UTF-8, you'd get your expected result. Furthermore, if you inspect the individual bytes in your string, you'll see that they are encoded correctly and hold your desired values.

You can push the IDE to decode the string as UTF-8, though, on an as-needed basis: in the Watch window type str,s8; or, in the Command window, type ? &str[0],s8. These techniques are explored by Giovanni Dicanio in his article "What's Wrong with My UTF-8 Strings in Visual Studio?".


It's not even really ASCII; it'll be some 8-bit encoding decided by your system, most likely the code page Windows-1252 given the platform. ASCII only defines the lower 7 bits. Historically, the various 8-bit code pages have been colloquially (if incorrectly) called "extended ASCII" in various settings. But the point is that the multi-byte nature of the data is not at all considered by the component rendering the string to your screen, let alone specifically its UTF-8-ness.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    ASCII doesn't have any accented letters. Please don't tell me this is "the extended ASCII" either, there's no such thing. This is most likely CP1252. – n. m. could be an AI Sep 19 '19 at 16:38
  • @n.m. Well, granted - I have added a caveat – Lightness Races in Orbit Sep 20 '19 at 10:13
  • @LightnessRacesinOrbit I really appreciate your answer and your time and I tried this method in VS Watch and you was correct, I saw the right format, but then I called the boost::filesystem::create_directories(str) both in my project and both in my sandbox project and the folder was created had different names. In my sandbox project it was the correct folder name with accents, but in my project the folder had this name: árvĂ­ztűrĹ‘tĂĽkörfĂşrĂłgĂ©p. I suppose there is still an encoding issue, but I can not wrap my head around this. – Csuszmusz Sep 20 '19 at 10:38
  • @Csuszmusz Again you need to provide the encoding appropriate to the task. You are passing UTF-8 to `boost::filesystem::create_directories()` on Windows; this converts the arg to `boost::filesystem::path` (because that's what the function takes), and you already know that such a path should be a `wstring` holding UTF-16. You'll have to re-encode it back again! (There _is_ [a feature to do this autonomously](https://www.boost.org/doc/libs/1_69_0/libs/filesystem/doc/reference.html#path-Encoding-conversions) but I think you have to play with locales to make it work the way you want) – Lightness Races in Orbit Sep 20 '19 at 10:42
  • Part1 @LightnessRacesinOrbit To extend why do I need this is that because I am using boost ipc and I am overriding the get_shared_dir(std::string &shared_dir); function which takes a string and if I create_directories with boost::path as a parameter the directory name is correct but the function gives back the string containing the árvĂ­ztűrĹ‘tĂĽkörfĂşrĂłgĂ©p and boost is using it inside the boost library where I can not explicit re-encode the string back and it is looking for the folder with utf8 name which will not exist, so it looks like the root cause is indeed the locales. – Csuszmusz Sep 20 '19 at 11:06
  • Part2 @LightnessRacesinOrbit ... which lead me to this problem: [locale setting issue] (https://stackoverflow.com/questions/58013410/stdlocale-throws-runtime-error-exception-to-en-us-utf-8-locale) Also thanks again for your time and support. – Csuszmusz Sep 20 '19 at 11:07
  • You keep linking to that page, and I don't know why because it does not seem to have anything to do with this question. This is not fundamentally about locales; it's about you providing the wrong input to library functions. It's simple: if a function expects a string with UTF-8 in it, provide that. If a function expects a wstring with UTF-16 in it, provide that. Convert between them as needed (but try to do so a little as possible because this takes time!) – Lightness Races in Orbit Sep 20 '19 at 11:11