Yes, there are a number of serious conflicts and problems with the C++ conflation of rôles for char
, but also the question conflates a few things. So a simple direct answer would be like answering “yes”, “no” or “don’t know” to the question “have you stopped beating your wife?”. The only direct answer is the buddhist “mu”, unasking the question.
So let's therefore start with a look at the facts.
Facts about the char type.
The number of bits per char
is given by the implementation defined CHAR_BIT
from the <limits.h>
header. This number is guaranteed to be 8 or larger. With C++03 and earlier that guarantee came from the specification of that symbol in the C89 standard, which the C++ standard noted (in a non-normative section, but still) as “incorporated”. With C++11 and later the C++ standard explicitly, on its own, gives the ≥8 guarantee. On most platforms CHAR_BIT
is 8, but on some probably still extant Texas Instruments digital signal processors it’s 16, and other values have been used.
Regardless of the value of CHAR_BIT
the sizeof(char)
is by definition 1, i.e. it's not implementation defined:
C++11 §5.3.3/1 (in [expr.sizeof]):
” sizeof(char)
, sizeof(signed char)
and
sizeof(unsigned char)
are 1.
That is, char
and its variants is the fundamental unit of addressing of memory, which is the primary meaning of byte, both in common speech and formally in C++:
C++11 §1.7/1 (in [intro.memory]):
” The fundamental storage unit in the C ++ memory model is the byte.
This means that on the aforementioned TI DSPs, there is no C++ way of obtaining pointers to individual octets (8-bit parts). And that in turn means that code that needs to deal with endianness, or in other ways needs to treat char
values as sequences of octets, in particular for network communications, needs to do things with char
values that is not meaningful on a system where CHAR_BIT
is 8. It also means that ordinary C++ narrow string literals, if they adhere to the standard, and if the platform's standard software uses an 8-bit character encoding, will waste memory.
The waste aspect was (or is) directly addressed in the Pascal language, which differentiates between packed strings (multiple octets per byte) and unpacked strings (one octet per byte), where the former is used for passive text storage, and the latter is used for efficient processing.
This illustrates the basic conflation of three aspects in the single C++ type char
:
unit of memory addressing, a.k.a. byte,
smallest basic type (it would be nice with an octet
type), and
character encoding value unit.
And yes, this is a conflict.
Facts about UTF-16 encoding.
Unicode is a large set of 21-bit code points, most of which constitute characters on their own, but some of which are combined with others to form characters. E.g. a character with accent like “é” can be formed by combining code points for “e” and “´”-as-accent. And since that’s a general mechanism it means that a Unicode character can be an arbitrary number of code points, although it’s usually just 1.
UTF-16 encoding was originally a compatibility scheme for code based on original Unicode’s 16 bits per code point, when Unicode was extended to 21 bits per code point. The basic scheme is that code points in the defined ranges of original Unicode are represented as themselves, while each new Unicode code point is represented as a surrogate pair of 16-bit values. A small range of original Unicode is used for surrogate pair values.
At the time, examples of software based on 16 bits per code point included 32-bit Windows and the Java language.
On a system with 8-bit byte UTF-16 is an example of a wide text encoding, i.e. with an encoding unit wider than the basic addressable unit. Byte oriented text encodings are then known as narrow text. On such a system C++ char
fits the latter, but not the former.
In C++03 the only built in type suitable for the wide text encoding unit was wchar_t
.
However, the C++ standard effectively requires wchar_t
to be suitable for a code-point, which for modern 21-bits-per-code-point Unicode means that it needs to be 32 bits. Thus there is no C++03 dedicated type that fits the requirements of UTF-16 encoding values, 16 bits per value. Due to historical reasons the most prevalent system based on UTF-16 as wide text encoding, namely Microsoft Windows, defines wchar_t
as 16 bits, which after the extension of Unicode has been in flagrant contradiction with the standard, but then, the standard is impractical regarding this issue. Some platforms define wchar_t
as 32 bits.
C++11 introduced new types char16_t
and char32_t
, where the former is (designed to be) suitable for UTF-16 encoding values.
About the question.
Regarding the question’s stated assumption of
” a system with UTF-16 encoding character set
this can mean one of two things:
- a system with UTF-16 as the standard narrow encoding, or
- a system with UTF-16 as the standard wide encoding.
With UTF-16 as the standard narrow encoding CHAR_BIT
≥ 16, and (by definition) sizeof(char)
= 1. I do not know of any system, i.e. it appears to be hypothetical. Yet it appears to be the meaning tacitly assumed in current other answers.
With UTF-16 as the standard wide encoding, as in Windows, the situation is more complex, because the C++ standard is not up to the task. But, to use Windows as an example, one practical possibility is that sizeof(wchar_t)
= 2. And one should just note that the standard is conflict with existing practice and practical considerations for this issue, when the ideal is that standards instead standardize the existing practice, where there is such.
Now finally we’re in a position to deal with the question,
” Is there a conflict between these statements above or is the sizeof(char) = 1
just a default (definition) value and will be implementation-defined depends on each system?
This is a false dichotomy. The two possibilities are not opposites. We have
There is indeed a conflict between char
as character encoding unit and as a memory addressing unit (byte). As noted, the Pascal language has the keyword packed
to deal with one aspect of that conflict, namely storage versus processing requirements. And there is a further conflict between the formal requirements on wchar_t
, and its use for UTF-16 encoding in the most widely used system that employs UTF-16 encoding, namely Windows.
sizeof(char) = 1
by definition: it's not system-dependent.
CHAR_BIT
is implementation defined, and is guaranteed ≥ 8.