12

Is there any difference between these two string storage formats?

DavidRR
  • 18,291
  • 25
  • 109
  • 191
hkBattousai
  • 10,583
  • 18
  • 76
  • 124
  • 1
    there's a pretty good answer to this question here: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring/402918#402918 – Idan K Nov 22 '10 at 15:49

3 Answers3

17

std::wstring is a container of wchar_t. The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type.

UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers.

Using Visual Studio, if you use wide character literals (e.g. L"Hello World") that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate pairs into Unicode code points for you, even if wchar_t is 16 bits.

DavidRR
  • 18,291
  • 25
  • 109
  • 191
JoeG
  • 12,994
  • 1
  • 38
  • 63
9

UTF-16 is a specific Unicode encoding. std::wstring is a string implementation that uses wchar_t as its underlying type for storing each character. (In contrast, regular std::string uses char).

The encoding used with wchar_t does not necessarily have to be UTF-16—it could also be UTF-32 for example.

DavidRR
  • 18,291
  • 25
  • 109
  • 191
ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
3

UTF-16 is a concept of text represented in 16-bit elements but an actual textual character may consist of more than one element

std::wstring is just a collection of these elements, and is a class primarily concerned with their storage.

The elements in a wstring, wchar_t is at least 16-bits but could be 32 bits.

LinuxDev
  • 193
  • 1
  • 7
CashCow
  • 30,981
  • 5
  • 61
  • 92
  • Can you please explain in more detail, like giving an example. For instance the character 'A' is stored in std::wstring like "0x0041". How is it stored in UTF-16 format? – hkBattousai Nov 22 '10 at 15:50
  • 7
    16-**byte** ?? woah that's a hardcore character encoding – Inverse Nov 22 '10 at 15:51
  • 2
    @Inverse: That's why everyone should just use ASCII, there wouldn't be so much grief on memory use ;) – Matthieu M. Nov 22 '10 at 16:36
  • 1
    For those who may not understand the humor in the above comments, [UTF-16](https://en.wikipedia.org/wiki/UTF-16) is a 16-***bit*** Unicode encoding. Also, in UTF-16, a character that is defined using more than one 16-bit element is done so via [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF). – DavidRR Apr 27 '15 at 13:59