3

I'm trying to convert std::string which contains some accented character to std::wstring as explained in C++ Convert string (or char*) to wstring (or wchar_t*), but my program throws bad conversion exception.

I'm on Windows 10 and using MSVC 2022 v17.4.1, with language set to C++17.

Here is a minimal reproducible program which demonstrate the issue:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

#pragma warning( disable : 4996  )

int main()
{
    std::string s{ "hello ê world" };
    
    try {
        std::wstring ws = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(s);
        std::wcout << ws << "\n";
    }
    catch (const std::exception& e) {
        std::cout << e.what() << "\n";
    }
}

Any help in converting the above std::string to std::wstring is highly appreciated.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
Aamir
  • 1,974
  • 1
  • 14
  • 18
  • 6
    "*`std::string s{ "hello ê world" };`*" This is not necessarily encoded in UTF-8. – Nicol Bolas Dec 16 '22 at 15:16
  • 2
    If you are going to be dealing with unicode and different text encodings I recommend using a library like ICU instead. Do note that `std::codecvt_utf8` was deprecated in C++17. – NathanOliver Dec 16 '22 at 15:16
  • 3
    If you want to test your code I suggest you explicitly initialise your string with the correct byte sequence. There is absolutely no guarantee that a string literal in your source code is encoded in UTF-8 – john Dec 16 '22 at 15:27
  • So something like `"hello \xC3\xAA world"`, C3 AA being the correct UTF-8 byte sequence for lower case e with circumflex. – john Dec 16 '22 at 15:31
  • If it still fails with that string then ask again. – john Dec 16 '22 at 15:32
  • @john `"hello \xC3\xAA world"` does work, I get the output as `hello Ω world`. In my case I receive the string input from the user, how can I make sure that the input byte stream in encoded in UTF-8 format? – Aamir Dec 16 '22 at 15:47
  • @NicolBolas, how can I make sure that the string is encoded in UTF-8 format? – Aamir Dec 16 '22 at 15:54
  • 2
    This migth be helpful to understand what happens: https://stackoverflow.com/a/67819605/1387438 basically your encoding is inconsistent in one of the steps: source code, compilation, executable, global locale configuration. Note that MSVC prefer to use your system locale which is usually not UTF-8 but some county specific one byte encoding. – Marek R Dec 16 '22 at 15:54
  • @Aamir "*how can I make sure that the input byte stream in encoded in UTF-8 format?*" - it depends on how you are receiving the input to begin with, which you did not show. – Remy Lebeau Dec 16 '22 at 16:41
  • @RemyLebeau, I use `wxTextCtrl` from wxWidgets GUI library to get input from the user. Apart from normal ASCII characters, user can enter accented characters as input, which I need to convert to `wstring`. – Aamir Dec 16 '22 at 18:04
  • @Aamir [`wxTextCtrl`](https://docs.wxwidgets.org/3.0/classwx_text_ctrl.html) provides text to you as a [`wxString`](https://docs.wxwidgets.org/3.0/classwx_string.html), which is a native Unicode string that is convertible to `std::wstring` without data loss (see [`wxString::ToStdWstring()`](https://docs.wxwidgets.org/3.0/classwx_string.html#acd4ba44e34428aa83cd9922d2933d060)). So, why are you trying to convert the text to a UTF-8 `std::string` and then to a UTF-16 `std::wstring`? Just let the text handle the UTF-16 conversion for you. – Remy Lebeau Dec 16 '22 at 19:05
  • @RemyLebeau, I don’t receive the wxString directly, wxString is passed as std::string to different C++ library before received by my function. – Aamir Dec 16 '22 at 19:27
  • 1
    @Aamir ok, well, even so, `wxString` can be converted to UTF-8 using [`wxString::utf8_str()`](https://docs.wxwidgets.org/3.0/classwx_string.html#ad71e3ded85939db8af9eeadfa02719ac), which can then be used to construct a `std::string`, so there is no reason for you to have a `std::string` that is not valid UTF-8 for `std::wstring_convert` to decode. – Remy Lebeau Dec 16 '22 at 21:42
  • "I receive the string input from the user, how can I make sure that the input byte stream in encoded in UTF-8 format?" It's a separate, completely different question. – n. m. could be an AI Dec 16 '22 at 22:24

1 Answers1

2

You need to both build with the /utf-8 compiler flag and save your file as UTF-8.

To save a file as UTF-8 in Visual Studio, select "Save with Encoding..." from the "Save As" dialog.

Save As dialog

Your string is probably being read in as "hello ê world" or another non-UTF-8 string in another codepage.

From the Visual Studio documentation:

If no byte-order mark is found, it assumes that the source file is encoded in the current user code page, unless you've specified a code page by using /utf-8 or the /source-charset option.

Etienne Laurin
  • 6,731
  • 2
  • 27
  • 31