Confusing sizeof(char) by ISO/IEC in different character set encoding like UTF-16

Question

Assuming that a program is running on a system with UTF-16 encoding character set. So according to The C++ Programming Language - 4th, page 150:

A char can hold a character of the machine’s character set.

→ I think that a char variable will have the size is 2-bytes.

But according to ISO/IEC 14882:2014:

sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1".

or The C++ Programming Language - 4th, page 149:

"[...], so by definition the size of a char is 1"

→ It is fixed with size is 1.

Question: Is there a conflict between these statements above or is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?

what do you mean with "a system with UTF-16 encoding" basically in c++ you are resposible for your own encoding. and utf-16 strings are stored in wchar arrays/strings. — Arne, Mar 30 '15 at 03:56
@Arne: It is perfectly conceivable that a system could use UTF-16 as its default basic encoding. — Dolda2000, Mar 30 '15 at 03:57
@Dolda2000 yes, but just because the system's default encoding is UTF-16, it doesn't mean you have to do any utf16 in your program. — Arne, Mar 30 '15 at 04:01
@Arne: Of course not, but that's besides the point. The point of the question is that "The C++ Programming Language" defines `char` in terms of "the machine's character set". — Dolda2000, Mar 30 '15 at 04:05
yup, I mean the same thing as Dolda2000 said. It's default basic encoding because I want to mention about the use of sizeof(char) in case of some condition statements for example. :) — kembedded, Mar 30 '15 at 04:05
@kembedded I am sorry to say this, but char is always one byte never 2 (for compatibility) and it will never change. If you want to use utf16, wchar and wstring is your friend. — Arne, Mar 30 '15 at 04:11
Historically "a character" was a common use for "a byte", and at the lowest levels they were essentially the same thing. Both the move that meant that almost (but still not quite) all computers had bytes that are octets (8-bits) and the move that meant we used more than one byte per character so we can represent real text well came later. The legacy of this is that while some later languages have a *char* type suitable for use with characters separate to a *byte* type, in C and C++ *char* is a name for bytes. — Jon Hanna, Mar 30 '15 at 09:40
@Arne: wcchar and wstring are absolutely not your friend when you want UTF-16, in contrast to an unspecified wide encoding, which is most likely actually the fixed-size UTF-32. — Deduplicator, Mar 30 '15 at 12:50
@JonHanna: which other language are you referring to when saying that some other languages have a `char` type suitable for use with characters? Because if it's Java, its `char` type is actually woefully inadequate, so much so that `char`-using APIs are deprecated and substituted by `int`-based APIs (e.g: `char charAt(int index)` → `int codePointAt(int index)`). — ninjalj, Mar 30 '15 at 18:49

paxdiablo · Accepted Answer · 2015-04-17T05:22:13.977

35

The C++ standard (and C, for that matter) effectively define byte as the size of a char type, not as an eight-bit quantity¹. As per C++11 1.7/1 (my bold):

The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined.

Hence the expression sizeof(char) is always 1, no matter what.

If you want to see whether you baseline char variable (probably the unsigned variant would be best) can actually hold a 16-bit value, the item you want to look at is CHAR_BIT from <climits>. This holds the number of bits in a char variable.

¹ Many standards, especially ones related to communications protocols, use the more exact term octet for an eight-bit value.

edited Apr 17 '15 at 05:22

answered Mar 30 '15 at 03:56

paxdiablo

854,327
234
1,573
1,953

Thanks for the quote. It seems a bit ambiguous IMO, though. Is there a definition of what constitutes "the basic execution character set" somewhere? – Dolda2000 Mar 30 '15 at 04:06
@Dolda2000, `3.9.1` defines `char` in almost the same terms as byte given in my (edited) answer. The difference is the UTF-8 bit but, given C++ defines the _minimum_ size of a `char` as eight bits, it's capable of holding the UTF-8 stuff as well. – paxdiablo Mar 30 '15 at 04:09
2

Yes, the basic execution character set is defined. There about 90 symbols in it. – M.M Mar 30 '15 at 04:11
Sorry, there's still a little bit confusing to me. As you said, C or C++ standard define 'byte' as the size of a 'char'. Hence the expression sizeof(char) is always 1, no matter what. But I found these statement in "The C++ Programming Language - 4th": "In addition, it is guaranteed that a char has at least 8 bits, .... A char can hold a character of the machine’s character set. The char type is supposed to be chosen by the implementation to be the most suitable type for holding and manipulating characters on a given computer; it is typically an 8-bit byte." – kembedded Mar 30 '15 at 04:41
So if, as my assumption, each character in machine's character set using 2-bytes encoding, so in this case, I think the sizeof(char) = 2. Is it right? :( (because a char has at least 8 bits, so it can has 16 bits,...) – kembedded Mar 30 '15 at 04:45
8

@kembedded: no, if the base character set uses 16 bits, then **a byte is sixteen bits wide.** You need to get away from thinking that a byte is 8 bits wide. In C/C++, it's not necessarily so. If you want to refer to an 8-bit value, use the term octet - a byte can be any size in C++ as long as it can contain the underlying data. If you want it to handle all of Unicode as well as every alien language in the galaxy, you could quite easily have it as 256 bits :-) – paxdiablo Mar 30 '15 at 04:49
@paxdiablo: yeah, thanks for all of your help. Finally, I understand my problem. I summary my understanding like this: sizeof(char) is always 1 byte and this byte means the size of 'char' (not 8-bits byte). So in case I use UTF-8, a 'char' is 1 character (1 code unit) => sizeof(char) = 1, in addition, CHAR_BIT is 8-bits byte. So in case I use UTF-16, a 'char' is 1 character (1 code unit) => sizeof(char) = 1, in addition, CHAR_BIT is 16-bits (2 bytes with 8-bits for each) Hope this right? :D – kembedded Mar 30 '15 at 05:10
2

@kembedded, that's _exactly_ right, despite your ongoing use of byte to _sometimes_ mean 8 bits :-) Still, I understood what you were getting at, I'm not sure _why_ ANSI/ISO didn't go the octet route like so many other standards. Still, too late now. – paxdiablo Mar 30 '15 at 05:12
@paxdiablo: The keyword is "a byte is sixteen bits wide". Thanks – kembedded Mar 30 '15 at 05:13
@paxdiablo: yeah, I will fix my mistake of using these terms – kembedded Mar 30 '15 at 05:14
2

@kembedded - that's almost right. I may be seeing a language issue here rather than a technical issue, but in "in case I use UTF-8" should be "in case the compiler uses UTF-8". It's not a matter of what the programmer does, but of what the compiler writers decided to do. – Pete Becker Mar 30 '15 at 13:41
I think the problem with this wording is that it's intentionally ambiguous for reasons that are lost to history. For example, there have been computers with 9-bit bytes and others with 60-bit words and 6-bit characters (but no real concept of a "byte"). – Gabe Mar 30 '15 at 17:35
@paxdiablo I suspect the reason they didn't go the octet route is that there's too much code out there that does `malloc(n)` to mean `malloc(n * sizeof(char))`. It's even common to remind people who use `*sizeof(char)` that this is redundant and unnecessary. – Barmar Mar 31 '15 at 20:03

score 28 · Answer 2 · edited May 23 '17 at 12:32

Yes, there are a number of serious conflicts and problems with the C++ conflation of rôles for char, but also the question conflates a few things. So a simple direct answer would be like answering “yes”, “no” or “don’t know” to the question “have you stopped beating your wife?”. The only direct answer is the buddhist “mu”, unasking the question.

So let's therefore start with a look at the facts.

Facts about the char type.

The number of bits per char is given by the implementation defined CHAR_BIT from the <limits.h> header. This number is guaranteed to be 8 or larger. With C++03 and earlier that guarantee came from the specification of that symbol in the C89 standard, which the C++ standard noted (in a non-normative section, but still) as “incorporated”. With C++11 and later the C++ standard explicitly, on its own, gives the ≥8 guarantee. On most platforms CHAR_BIT is 8, but on some probably still extant Texas Instruments digital signal processors it’s 16, and other values have been used.

Regardless of the value of CHAR_BIT the sizeof(char) is by definition 1, i.e. it's not implementation defined:

C++11 §5.3.3/1 (in [expr.sizeof]):

” sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1.

That is, char and its variants is the fundamental unit of addressing of memory, which is the primary meaning of byte, both in common speech and formally in C++:

C++11 §1.7/1 (in [intro.memory]):

” The fundamental storage unit in the C ++ memory model is the byte.

This means that on the aforementioned TI DSPs, there is no C++ way of obtaining pointers to individual octets (8-bit parts). And that in turn means that code that needs to deal with endianness, or in other ways needs to treat char values as sequences of octets, in particular for network communications, needs to do things with char values that is not meaningful on a system where CHAR_BIT is 8. It also means that ordinary C++ narrow string literals, if they adhere to the standard, and if the platform's standard software uses an 8-bit character encoding, will waste memory.

The waste aspect was (or is) directly addressed in the Pascal language, which differentiates between packed strings (multiple octets per byte) and unpacked strings (one octet per byte), where the former is used for passive text storage, and the latter is used for efficient processing.

This illustrates the basic conflation of three aspects in the single C++ type char:

unit of memory addressing, a.k.a. byte,
smallest basic type (it would be nice with an octet type), and
character encoding value unit.

And yes, this is a conflict.

Facts about UTF-16 encoding.

Unicode is a large set of 21-bit code points, most of which constitute characters on their own, but some of which are combined with others to form characters. E.g. a character with accent like “é” can be formed by combining code points for “e” and “´”-as-accent. And since that’s a general mechanism it means that a Unicode character can be an arbitrary number of code points, although it’s usually just 1.

UTF-16 encoding was originally a compatibility scheme for code based on original Unicode’s 16 bits per code point, when Unicode was extended to 21 bits per code point. The basic scheme is that code points in the defined ranges of original Unicode are represented as themselves, while each new Unicode code point is represented as a surrogate pair of 16-bit values. A small range of original Unicode is used for surrogate pair values.

At the time, examples of software based on 16 bits per code point included 32-bit Windows and the Java language.

On a system with 8-bit byte UTF-16 is an example of a wide text encoding, i.e. with an encoding unit wider than the basic addressable unit. Byte oriented text encodings are then known as narrow text. On such a system C++ char fits the latter, but not the former.

In C++03 the only built in type suitable for the wide text encoding unit was wchar_t.

However, the C++ standard effectively requires wchar_t to be suitable for a code-point, which for modern 21-bits-per-code-point Unicode means that it needs to be 32 bits. Thus there is no C++03 dedicated type that fits the requirements of UTF-16 encoding values, 16 bits per value. Due to historical reasons the most prevalent system based on UTF-16 as wide text encoding, namely Microsoft Windows, defines wchar_t as 16 bits, which after the extension of Unicode has been in flagrant contradiction with the standard, but then, the standard is impractical regarding this issue. Some platforms define wchar_t as 32 bits.

C++11 introduced new types char16_t and char32_t, where the former is (designed to be) suitable for UTF-16 encoding values.

About the question.

Regarding the question’s stated assumption of

” a system with UTF-16 encoding character set

this can mean one of two things:

a system with UTF-16 as the standard narrow encoding, or
a system with UTF-16 as the standard wide encoding.

With UTF-16 as the standard narrow encoding CHAR_BIT ≥ 16, and (by definition) sizeof(char) = 1. I do not know of any system, i.e. it appears to be hypothetical. Yet it appears to be the meaning tacitly assumed in current other answers.

With UTF-16 as the standard wide encoding, as in Windows, the situation is more complex, because the C++ standard is not up to the task. But, to use Windows as an example, one practical possibility is that sizeof(wchar_t)= 2. And one should just note that the standard is conflict with existing practice and practical considerations for this issue, when the ideal is that standards instead standardize the existing practice, where there is such.

Now finally we’re in a position to deal with the question,

” Is there a conflict between these statements above or is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?

This is a false dichotomy. The two possibilities are not opposites. We have

There is indeed a conflict between char as character encoding unit and as a memory addressing unit (byte). As noted, the Pascal language has the keyword packed to deal with one aspect of that conflict, namely storage versus processing requirements. And there is a further conflict between the formal requirements on wchar_t, and its use for UTF-16 encoding in the most widely used system that employs UTF-16 encoding, namely Windows.
sizeof(char) = 1 by definition: it's not system-dependent.
CHAR_BIT is implementation defined, and is guaranteed ≥ 8.

although this fair difficult to understand completely all of your means, but I think I almost see your points of view and your explanation. Thanks for answer and sharing your experience. It's really helpful and make me clear more things about connection between memory storage and character encoding and something like the conflict as you mentioned. — kembedded, Mar 30 '15 at 08:20
Although difficult for understanding at first glance to me, but, this is really outstanding answer. — kembedded, Mar 30 '15 at 14:16
Assuming a system with a minimal addressable unit of 8 bits and where UTF-16 is the one and only character set, paxdiablo's answer implies that `CHAR_BIT` is 16, `sizeof(char)` is 1, `sizeof(uint32_t)` is 2, and you can fit a native character into a `char`. Yours implies that `CHAR_BIT` is 8, `sizeof(char)` is 1, `sizeof(uint32_t)` is 4, and a native character only fits into a `wchar_t` or larger. Which is correct? — Mark, Mar 30 '15 at 23:27
@Mark: With a "minimal addresable unit of 8 bits" you have `CHAR_BIT` = 8. The only way to make sense of "UTF-16 as the only character [encoding]" for such system is that `wchar_t` or `char16_t` some such type is used in C++, and all literals wide (e.g. `L"Hello"`); an 8-bit `char` is not sufficient as encoding unit for UTF-16. [Pax Diablo's answer](http://stackoverflow.com/a/29338173/464581) simply does not discuss this particular case, as far as I can see, and neither do I. Because those are conflicting requirements that won't occur in practice. — Cheers and hth. - Alf, Mar 31 '15 at 04:31
@kembedded; Mark poses conflicting requirements. You resolved it by adjusting `CHAR_BIT` so that the C++ memory unit is larger than the hardware memory unit. In theory that is a valid way to resolve it, but in practice a C++ implementation will always support the hardware first of all, and only then make compromises about conventions such as character encodings. And that means that if those conflicting requirements were to occur for real, then a C++ implementation would just not support narrow text literals. It has to not support *something*. — Cheers and hth. - Alf, Mar 31 '15 at 05:24
@Cheersandhth.-Alf: ah, I see. So my remaining ambiguous problem is clear. The value of **CHAR_BIT** is the actual size of **char** (in this case, **char** is a unit of memory addressing). Also the value of **CHAR_BIT** is determined with highest priority based on the hardware configuration. Right? :) — kembedded, Mar 31 '15 at 05:37
@kembedded: Yes. Because software requirements are more ... soft. ;-) A compiler that doesn't support the hardware is inherently limited, but when the hardware support is there, then logical constraints on its use (such as the unrealistic one of not supporting any other encoding than UTF-16) can be easily added, e.g. as a special option. — Cheers and hth. - Alf, Mar 31 '15 at 05:56

AnT stands with Russia · Answer 3 · 2015-03-30T05:42:19.677

9

No, there's no conflict. These two statements refer to different definitions of byte.

UTF-16 implies that byte is the same thing as octet - a group of 8 bits.

In C++ language byte is the same thing as char. There's no limitation on how many bits a C++-byte can contain. The number of bits in C++-byte is defined by CHAR_BIT macro constant.

If your C++ implementation decides to use 16 bits to represent each character, then CHAR_BIT will be 16 and each C++-byte will occupy two UTF-16-bytes. sizeof(char) will still be 1 and sizes of all objects will be measured in terms of 16-bit bytes.

edited Mar 30 '15 at 05:42

answered Mar 30 '15 at 03:55

AnT stands with Russia

312,472
42
525
765

Ah, finally, I see. Thanks for your answer. It is helpful to me and make me understand the problem. sizeof(char) is always 1 byte and this byte means the size of 'char' (not 8-bits byte). So in case I use UTF-8, a 'char' is 1 character (1 code unit) => sizeof(char) = 1, in addition, CHAR_BIT is 8-bits byte. So in case I use UTF-16, a 'char' is 1 character (1 code unit) => sizeof(char) = 1, in addition, CHAR_BIT is 16-bits (2 bytes with 8-bits for each) – kembedded Mar 30 '15 at 05:08
1

Re "No, there's no conflict", there are a number of conflicts, as I've now laid out in a separate answer, and re "These two statements", well the OP cites three statements. So, that's just wrong. – Cheers and hth. - Alf Mar 30 '15 at 06:06

score 7 · Answer 4 · edited Mar 30 '15 at 11:32

7

A char is defined as being 1 byte. A byte is the smallest addressable unit. This is 8 bits on common systems, but on some systems it is 16 bits, or 32 bits, or anything else (but must be at least 8 for C++).

It is somewhat confusing because in popular jargon byte is used for what is technically known as an octet (8 bits).

So, your second and third quotes are correct. The first quote is, strictly speaking, not correct.

As defined by [intro.memory]/1 in the C++ Standard, char only needs to be able to hold the basic execution character set which is approximately 100 characters (all of which appear in the 0 - 127 range of ASCII), and the octets that make up UTF-8 encoding. Perhaps that is what the author meant by machine character set.

On a system where the hardware is octet addressable but the character set is Unicode, it is likely that char will remain 8-bit. However there are types char16_t and char32_t (added in C++11) which are designed to be used in your code instead of char for systems that have 16-bit or 32-bit character sets.

So, if the system goes with char16_t then you would use std::basic_string<char16_t> instead of std::string, and so on.

Exactly how UTF-16 should be handled will depend on the detail of the implementation chosen by the system. Unicode is a 21-bit character set and UTF-16 is a multibyte encoding of it; so the system could go the Windows-like route and use std::basic_string<char16_t> with UTF-16 encoding for strings; or it could go for std::basic_string<char32_t> with raw Unicode code points as the characters.

Alf's post goes into more detail on some of the issues that can arise.

edited Mar 30 '15 at 11:32

Thierry

1,099
9
19

answered Mar 30 '15 at 03:56

M.M

138,810
21
208
365

As for the distinction between "character set" and "encoding", it should be said that historically the two have been quite conflated, and it is not entirely clear what might being referred to. Just think of the common `Content-Type: n/n; charset=` MIME header, for instance. – Dolda2000 Mar 30 '15 at 04:09
Re "There is no conflict", there are a number of conflicts, as I've now laid out in a separate answer. So, that's just wrong. – Cheers and hth. - Alf Mar 30 '15 at 06:03
@Cheersandhth.-Alf OP was asking about conflicts between the three quotes in his post. His last two quotes are correct and consistent with each other. The non-Standard quote "A char can hold a character of the machine’s character set." may appear to conflict with "A char is 1 byte", however I explain that there is no problem because (i) 1 byte may be large enough to hold any character in the set, and/or (ii) a char is not actually required to hold every character in the "machine's character set" , it only needs to hold the characters in the basic execution set and UTF-8. ([intro.memory]/1) – M.M Mar 30 '15 at 06:11
@MattMcNabb: Fair enough, you're talking about **formal conflicts**. Still it's wrong to just say that there is "no conflict", when there are lots of very obvious and very well known practical conflicts. I mention in particular memory waste (so important that it influenced the design of Pascal) and the need for special case code for non-octet-byte systems, i.e. no uniform treatment. I think instead of "no conflict" you should write "no formal conflict". To be clear. – Cheers and hth. - Alf Mar 30 '15 at 06:24
@Cheersandhth.-Alf rewrote my post. Some things would depend on this implementation that OP hasn't specified in detail – M.M Mar 30 '15 at 06:35

score 5 · Answer 5 · answered Mar 30 '15 at 06:45

Without quoting standard it is easily to give simply answer because:

Definition of byte is not 8 bits. Byte is any size but smallest addressable unit of memory. Most commonly it is 8 bits, but there is no reason to don't have 16 bits byte.

C++ standard gives more restrictions because it must be at least 8 bits.

So there is no problem to sizeof(char) be always 1, no matter what. Sometimes it will be stand as 8 bits, sometimes 16 bits and so on.

Confusing sizeof(char) by ISO/IEC in different character set encoding like UTF-16

5 Answers5

Facts about the char type.

Facts about UTF-16 encoding.

About the question.