What is the difference between "UTF-16" and "std::wstring"?

Question

Is there any difference between these two string storage formats?

there's a pretty good answer to this question here: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring/402918#402918 — Idan K, Nov 22 '10 at 15:49

score 17 · Accepted Answer · edited Apr 27 '15 at 13:45

17

std::wstring is a container of wchar_t. The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type.

UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers.

Using Visual Studio, if you use wide character literals (e.g. L"Hello World") that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate pairs into Unicode code points for you, even if wchar_t is 16 bits.

edited Apr 27 '15 at 13:45

DavidRR

18,291
25
109
191

answered Nov 22 '10 at 15:50

JoeG

12,994
1
38
63

Do you mean that std::wstring is the same with UTF-16 for only the non-BMP unicode character when used in Windows operating system? – hkBattousai Nov 22 '10 at 15:53
8

No. std::wstring is just a container of integers. The encoding of the container depends entirely on the data you insert into the container. – JoeG Nov 22 '10 at 16:06
1

+1: For people unfamiliar with UTF it may be wise to define BMP. – Martin York Nov 22 '10 at 17:00
1

Your last paragraph is the answer to my question. Thank you. – hkBattousai Nov 24 '10 at 13:15

score 9 · Answer 2 · edited Apr 27 '15 at 14:10

9

UTF-16 is a specific Unicode encoding. std::wstring is a string implementation that uses wchar_t as its underlying type for storing each character. (In contrast, regular std::string uses char).

The encoding used with wchar_t does not necessarily have to be UTF-16—it could also be UTF-32 for example.

edited Apr 27 '15 at 14:10

DavidRR

18,291
25
109
191

answered Nov 22 '10 at 15:49

ThiefMaster

310,957
84
592
636

3

It could also be UCS-2 or S-JIS or Big 5 or ... well, anything. – greyfade Nov 22 '10 at 16:59

score 3 · Answer 3 · edited Jul 07 '20 at 04:49

3

UTF-16 is a concept of text represented in 16-bit elements but an actual textual character may consist of more than one element

std::wstring is just a collection of these elements, and is a class primarily concerned with their storage.

The elements in a wstring, wchar_t is at least 16-bits but could be 32 bits.

edited Jul 07 '20 at 04:49

LinuxDev

193
1
7

answered Nov 22 '10 at 15:48

CashCow

30,981
5
61
92

Can you please explain in more detail, like giving an example. For instance the character 'A' is stored in std::wstring like "0x0041". How is it stored in UTF-16 format? – hkBattousai Nov 22 '10 at 15:50
7

16-**byte** ?? woah that's a hardcore character encoding – Inverse Nov 22 '10 at 15:51
2

@Inverse: That's why everyone should just use ASCII, there wouldn't be so much grief on memory use ;) – Matthieu M. Nov 22 '10 at 16:36
1

For those who may not understand the humor in the above comments, [UTF-16](https://en.wikipedia.org/wiki/UTF-16) is a 16-***bit*** Unicode encoding. Also, in UTF-16, a character that is defined using more than one 16-bit element is done so via [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF). – DavidRR Apr 27 '15 at 13:59

What is the difference between "UTF-16" and "std::wstring"?

3 Answers3

Linked