Convert wstring to string encoded in UTF-8

Question

I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.

My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.

Does anybody know how to do it?

I already tried this:

  locale mylocale("cs_CZ.utf-8");
  mbstate_t mystate;

  wstring mywstring = L"čřžýáí";

  const codecvt<wchar_t,char,mbstate_t>& myfacet =
    use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);

  codecvt<wchar_t,char,mbstate_t>::result myresult;  

  size_t length = mywstring.length();
  char* pstr= new char [length+1];

  const wchar_t* pwc;
  char* pc;

  // translate characters:
  myresult = myfacet.out (mystate,
      mywstring.c_str(), mywstring.c_str()+length+1, pwc,
      pstr, pstr+length+1, pc);

  if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
   cout << "Translation successful: " << pstr << endl;
  else cout << "failed" << endl;
  return 0;

which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.

take a look at this link: http://www.boost.org/doc/libs/1_42_0/libs/serialization/doc/codecvt.html might be of some help — smerlin, Dec 05 '10 at 13:14
"one utf-8 character is read into two normal characters (which is how utf-8 works)" No it's not. UTF-16 (mostly) works this way, but a UTF-8 codepoint is represented by one to 4 bytes, and a "character" can consist of multiple codepoints. — ephemient, Dec 05 '10 at 14:32

score 94 · Answer 1 · edited Jan 24 '14 at 21:18

94

The code below might help you :)

#include <codecvt>
#include <string>

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.from_bytes(str);
}

// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

edited Jan 24 '14 at 21:18

Emerick Rogul

6,706
3
32
39

answered Oct 15 '12 at 21:00

skyde

2,816
4
34
53

7

But not on linux using libstdc++. – Tom Jul 24 '14 at 05:08
1

While the above work. I strongly suggest looking into Unicode library such as ICU and Boost.Locale. – skyde May 25 '16 at 17:58
It works like a charm for any `std::wstring`. Small test here: http://stackoverflow.com/a/37531136/1802974 – Victor Mezrin May 30 '16 at 17:41
you might also need #include , otherwise it should build fine with libc++ – Hofi May 10 '17 at 15:45
Dude... Spent hours trying to do this properly on windows to give a COM port to CreateFileW... Thank You! – MadHatter Nov 03 '17 at 18:13
4

`codecvt` is deprecated as of C++17 and there is no replacement. – Alex Reinking Aug 30 '20 at 06:10
2

@AlexReinking cpp reference doesn't say that codecvt is deprecated. While some members are deprecated, there are new ones that are added (eg. C++20 adds ```std::codecvt```). https://en.cppreference.com/w/cpp/locale/codecvt – Sahil Singh Mar 19 '21 at 19:13
Note that this code does not work with emojis on Windows . I posted a platform-independent version [below](https://stackoverflow.com/a/76943362/758345). – Chronial Aug 21 '23 at 07:55

hillel · Answer 2 · 2010-12-05T18:18:36.857

9

What's your platform? Note that Windows does not support UTF-8 locales so this may explain why you're failing.

To get this done in a platform dependent way you can use MultiByteToWideChar/WideCharToMultiByte on Windows and iconv on Linux. You may be able to use some boost magic to get this done in a platform independent way, but I haven't tried it myself so I can't add about this option.

edited Dec 05 '10 at 18:18

answered Dec 05 '10 at 17:51

hillel

2,343
2
18
25

score 3 · Answer 3 · answered Aug 19 '22 at 06:57

On Windows you have to use std::codecvt_utf8_utf16<wchar_t>! Otherwise your conversion will fail on Unicode code points that need two 16 bit code units. Like (U+1F609)

#include <codecvt>
#include <string>

// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv;
    return myconv.from_bytes(str);
}

// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

Avinash · Answer 4 · 2019-08-01T21:30:53.673

2

You can use boost's utf_to_utf converter to get char format to store in std::string.

std::string myresult = boost::locale::conv::utf_to_utf<char>(my_wstring);

edited Aug 01 '19 at 21:30

answered Feb 16 '19 at 00:33

Avinash

41
4

Chronial · Answer 5 · 2023-08-23T15:11:09.847

The currently most upvoted answer is not platform-independent. It breaks on non-BMP characters (i.e. Emojis ). JWiesemann already pointed this out in their answer, but their code will only work on windows.

So here's a correct platform-independent version:

#include <codecvt>
#include <codecvt>
#include <string>
#include <type_traits>

std::string wstring_to_utf8(std::wstring const& str)
{
  std::wstring_convert<std::conditional_t<
        sizeof(wchar_t) == 4,
        std::codecvt_utf8<wchar_t>,
        std::codecvt_utf8_utf16<wchar_t>>> converter;
  return converter.to_bytes(str);
}

std::wstring utf8_to_wstring(std::string const& str)
{
  std::wstring_convert<std::conditional_t<
        sizeof(wchar_t) == 4,
        std::codecvt_utf8<wchar_t>,
        std::codecvt_utf8_utf16<wchar_t>>> converter;
  return converter.from_bytes(str);
}

On msvc this might generate some deprecation warnings. You can disable these by wrapping the functions in

#pragma warning(push)
#pragma warning(disable : 4996)
<the two functions>
#pragma warning(pop)

See this answer to another question as to why it's ok to disable that warning.

score -1 · Answer 6 · answered Dec 05 '10 at 13:23

-1

What locale does is that it gives the program information about the external encoding, but assuming that the internal encoding didn't change. If you want to output UTF-8 you need to do it from wchar_t not from char*.

What you could do is output it as raw data (not string), it should be then correctly interpreted if the systems locale is UTF-8.

Plus when using (w)cout/(w)cerr/(w)cin you need to imbue the locale on the stream.

answered Dec 05 '10 at 13:23

Šimon Tóth

35,456
20
106
151

UTF-_8_ uses _8_-bit code units. `char` (as well as `signed char` and `unsigned char`) must be a minimum of _8_ bits. I believe you may be thinking of UTF-16, UTF-32, UCS2, or UCS4. – Justin Time - Reinstate Monica Dec 13 '16 at 01:29

score -2 · Answer 7 · answered Jul 26 '12 at 22:54

-2

The Lexertl library has an iterator that lets you do this:

std::string str;
str.assign(
  lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.begin()),
  lexertl::basic_utf8_out_iterator<std::wstring::const_iterator>(wstr.end()));

answered Jul 26 '12 at 22:54

Frank

64,140
93
237
324

score -11 · Accepted Answer · answered Dec 05 '10 at 13:34

-11

C++ has no idea of Unicode. Use an external library such as ICU (UnicodeString class) or Qt (QString class), both support Unicode, including UTF-8.

answered Dec 05 '10 at 13:34

Philipp

48,066
12
84
109

8

-1 not really true, C++ supports locales which includes encoding (unfortunately this is broken for UTF-8 on Windows) – Šimon Tóth Dec 05 '10 at 19:50
Agree. C++ doesn't _guarantee_ Unicode, or the existence of `locale ("cs_CZ.utf-8");`. But if you've got a system with that locale, it better work. – MSalters Dec 06 '10 at 10:24
3

No longer true as of C++11. `char16_t` is specifically intended for UTF-16, and `char32_t` is specifically intended for UTF-32; C++14 expands on this, by requiring that the `char` types be large enough to store 256 distinct values specifically to be suitable for UTF-8. C++11 also added classes `codecvt_utf8`, `codecvt_utf16`, and `codecvt_utf8_utf16`, as well as two new specialisations of `codecvt` (`std::codecvt` and `std::codecvt`). So, C++ now officially supports UTF-8, UTF-16, UTF-32, UCS2, and UCS4. – Justin Time - Reinstate Monica Dec 13 '16 at 01:06
Out of those `codecvt`s: `codecvt_utf8` converts between UTF-8 and UCS2/UCS4, `codecvt_utf16` converts between UTF-16 and UCS2/UCS4, `codecvt_utf8_utf16` converts between UTF-8 and UTF-16, `codecvt`'s `char16_t` specialisation is also for UTF-8 and UTF-16, and `codecvt`'s `char32_t` specialisation converts between UTF-8 and UTF-32. Not 100% sure of exactly how they work yet, I actually just started learning Unicode conversion today. – Justin Time - Reinstate Monica Dec 13 '16 at 01:09

Convert wstring to string encoded in UTF-8

8 Answers8

Linked