how character sets are stored in strings and wstrings?

Question

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.

I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.

So my questions are as follows:

Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?

Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.

i'm working in C++ btw

`char` and `wchar_t` are just numbers, character set / encoding is how you interpret the numbers. — alain, Feb 11 '16 at 11:33
You will find lots of information here: [std::wstring VS std::string](http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring?rq=1) — Bo Persson, Feb 11 '16 at 11:44

score 1 · Answer 1 · answered Feb 11 '16 at 11:44

They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.

score 1 · Answer 2 · answered Feb 11 '16 at 12:00

Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?

This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.

On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.

If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?

The compiler knows what is appropriate for each system.

From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?

The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.

The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.

The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.

Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?

You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.

In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.

Okay thanks for the answer! So, is there a way to check what character sets and encoding is used for different types on my system? Also, regarding your last answer, Is there an example you can provide of a situation where using a u16string or u32string would be preferable over any other string? i've never used them before. — Luke Bourne, Feb 11 '16 at 12:09
You have to check the documentation for your system. There is no standard way to tell it in the program. Generally, there are a lot of other things that differ between systems, like OS calls, so the char sets is just one small part. The known character types are for when your app just *needs* an exact type for some reason. I haven't used them either, yet. :-) — Bo Persson, Feb 11 '16 at 12:38

score 0 · Answer 3 · answered Feb 11 '16 at 11:48

Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.

Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.

I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string

score 0 · Answer 4 · answered Feb 11 '16 at 11:56

As an example here is some info on how windows uses these types/encodings.

char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's

If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).

So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

how character sets are stored in strings and wstrings?

4 Answers4