36

I am trying to convert a C++ std::string to UTF-8 or std::wstring without losing information (consider a string that contains non-ASCII characters).

According to http://forums.sun.com/thread.jspa?threadID=486770&forumID=31:

If the std::string has non-ASCII characters, you must provide a function that converts from your encoding to UTF-8 [...]

What encoding does std::string.c_str() use? How can I convert it to UTF-8 or std::wstring in a cross-platform fashion?

Gili
  • 86,244
  • 97
  • 390
  • 689

2 Answers2

52

std::string per se uses no encoding -- it will return the bytes you put in it. For example, those bytes might be using ISO-8859-1 encoding... or any other, really: the information about the encoding is just not there -- you have to know where the bytes were coming from!

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    So essentially there is no way for me to convert std::string without knowing its encoding ahead of time? I ask because I'm writing an API function that takes in a std::string. I guess the documentation will need to instruct users what format to pass in. – Gili Jun 18 '09 at 04:49
  • 3
    @Gili, right: you cannot reliably convert a byte sequence in an unknown encoding to UTF-8 (or anything else;-). I recommend you ask the caller to supply UTF-8 data -- most other encodings don't allow encoding _every_ possible Unicode string. As @Naaff says, ASCII is a special case of UTF-8 (and ISO-8859-* and many other encodings), so if that's your case there's no worry (a footnote in the docs reminding the users of this fact might save _them_ worry;-). – Alex Martelli Jun 18 '09 at 04:59
  • 1
    ISO-8859-* are no way "special case" of UTF-8. They are simply different single byte encodings. – n0rd Jun 18 '09 at 09:50
  • 2
    ASCII strings are also UTF-8 strings and ISO-8859-1 strings &c: that's why the paren comes after UTF-8 and not right after ASCII;-). – Alex Martelli Jun 18 '09 at 12:43
  • 2
    The docs back this up: *Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters).* http://www.cplusplus.com/reference/string/string/ – Ohad Schneider Aug 27 '14 at 11:16
  • I would like to give an example about "**you have to know where the bytes were coming from!**". E.g. If you are using a Chinese Windows system and initializing a `string` with some Chinese characters in Visual Studio, then the encoding is based on the code page, which by default is `GB2312`, related to the **language/region** setting. – Rick Aug 03 '20 at 02:35
7

std::string contains any sequence of bytes, so the encoding is up to you. You must know how it is encoded. However, if you don't know that it is something else, it's probably just ASCII. In which case, it's already UTF-8 compatible.

Naaff
  • 9,213
  • 3
  • 38
  • 43
  • 21
    I have seen "it's probably just..." be the source of so many character encoding errors. I suggest never guessing when it comes to character encodings: Always be very explicit in what you take and what you produce. In each case, if you don't spec the character set, then spec an additional parameter/return value to indicate the encoding. – MtnViewMark Aug 24 '09 at 16:17