What assumption is safe for a C++ implementation's character set?

Question

In The C++ Programming Language 6.2.3, it says:

It is safe to assume that the implementation character set includes the decimal digits, the 26 alphabetic characters of English, and some of the basic punctuation characters. It is not safe to assume that:

There are no more than 127 characters in an 8-bit character set (e.g., some sets provide 255 characters).

There are no more alphabetic characters than English provides (most European languages provide more, e.g., æ, þ, and ß).

The alphabetic characters are contiguous (EBCDIC leaves a gap between 'i' and 'j').

Every character used to write C++ is available (e.g., some national character sets do not provide {, }, [, ], |, and \).

A char fits in 1 byte. There are embedded processors without byte accessing hardware for which a char is 4 bytes. Also, one could reasonably use a 16-bit Unicode encoding for the basic chars.

I'm not sure I understand the last two statements.

In section 2.3 of the standard, it says:

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " '
...

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits.

We can see that it is stated by the standard that characters like { } [ ] | \ are part of the basic execution character set. Then why TC++PL says it's not safe to assume that those characters are available in the implementation's character set?

And for the size of a char, in section 5.3.3 of the standard:

The sizeof operator yields the number of bytes in the object representation of its operand. ... ... sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1.

We can see that the standard states that a char is of 1 byte. What is the point TC++PL trying to make here?

Regarding the size of `char`, it's always `1`, but that doesn't have to mean it's one *byte*. — Some programmer dude, Jan 26 '14 at 14:40
And it's *not* safe to assume the digits or character sequences in the alphabet are contiguous. As you say yourself EBCDIC leave a gap in the encoding, but it's still a valid source encoding. (IBM mainframes still uses EBCDIC if I remember correctly.) — Some programmer dude, Jan 26 '14 at 14:43
Thanks for your reply. The standard says "The sizeof operator yields the number of bytes in the object representation of its operand." Then why sizeof(char)==1 doesn't mean it's one byte? @joachim-pileborg — goodbyeera, Jan 26 '14 at 14:44
The `char` type is a special case. You might want to read e.g. [this old SO question](http://stackoverflow.com/questions/2098149/what-platforms-have-something-other-than-8-bit-char). — Some programmer dude, Jan 26 '14 at 14:47
About `{`, `}`, `[`, `]` etc. not being available, example: you are writing a software for a digital thermometer with a LCD display, you code and compile it using your PC which has those characters. That doesn't mean that the device this code will run on (the thermometer) will support displaying those characters. Point being: dev system != target system. — user2802841, Jan 26 '14 at 14:50
A char maybe 9bit, but it's still 1 "byte". Am I understanding it correctly? @JoachimPileborg — goodbyeera, Jan 26 '14 at 14:54
@user2802841 But as required by the standard, the basic execution character set (the one on the target system in your example) must contain those characters "{}[]". — goodbyeera, Jan 26 '14 at 14:59
@user3237645 The standard requires dev system to have those, this does not apply to the target system. Target system might not even have any kind of character output, like a network controller or computer mouse. — user2802841, Jan 26 '14 at 15:15
@JoachimPileborg There are no constraints on the letters, but the standard requires that the digits 0-9 be contiguous and in that order (zero can't be after nine). — IronMensan, Nov 16 '15 at 19:47

score 1 · Accepted Answer · answered Jan 26 '14 at 14:51

The word "byte" seems to be used sloppily in the first quote. As far as C++ is concerned, a byte is always a char, but the number of bits it holds is platform-dependent (and available in CHAR_BITS). Sometimes you want to say "a byte is eight bits", in which case you get a different meaning, and that may have been the intended meaning in the phrase "a char has four bytes".
The execution character set may very well be larger than or incompatible with the input character set provided by the environment. Trigraphs and alternate tokens exist to allow the representation of execution-set characters with fewer input characters on such restricted platforms (e.g. not is identical for all purposes to !, and the latter is not available in all character sets or keyboard layouts).

score 1 · Answer 2 · answered Jan 26 '14 at 15:32

1

It used to be the case that some national variants of ASCII, such as the Scandinavian languages, used accented alphabetic characters for the code points where US ASCII has punctuation such as [, ], {, }. These are the reason that C89 included trigraphs — they allow code to be written in the 'invariant subset' of ISO 646. See the chart of the characters used in the national variants on the Wikipedia page.

For example, someone in Scandinavia might have to read:

#include <stdio.h>

int main(int argc, char **argv)
Å
    for (int i = 1; i < argc; i++)
        printf("%s\n", argvÆiØ);
    return 0;
ø

instead of:

#include <stdio.h>

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; i++)
        printf("%s\n", argv[i]);
    return 0;
}

Using trigraphs, you might write:

??=include <stdio.h>

int main(int argc, char **argv)
??<
    for (int i = 1; i < argc; i++)
        printf("%s??/n", argv??(i??));
    return 0;
??>

which is equally ghastly in any language.

I'm not sure how much of an issue this still is, but that's why the comments are there.

answered Jan 26 '14 at 15:32

Jonathan Leffler

730,956
141
904
1,278

Thank you very much for your answer. I'm still not very clear about this. What's the meaning of the standard's requirement of basic source character set and basic execution character set (which an object of type char is guaranteed to be able to hold) containing all these characters? – goodbyeera Jan 26 '14 at 16:18
I think it is a case of 'Theory, meet Practice; Practice, Theory'. The standard lays down requirements that are not always met in the real world. – Jonathan Leffler Jan 26 '14 at 16:29
In a footnote following the list of 91 characters of the basic source character set in the standard, it says "The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files." – goodbyeera Jan 27 '14 at 01:39
As for translation phase 1, the standard says: "Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations." – goodbyeera Jan 27 '14 at 01:45
Combining these tow quotes and your example, is it safe to guess that the concern that TC++PL expresses in my original post regarding {}[] not being available is mainly about the physical source file characters? When it comes to basic source character set and basic execution character set, those characters are always there. Is this correct? – goodbyeera Jan 27 '14 at 01:45
That seems like a reasonable interpretation. In one of the ISO 646 national variants, the mapping is an 'identity' mapping; code point 91 is displayed as `Æ` but is interpreted as `[` by the compiler. (The Unicode point for `Æ` is U+00C6.) – Jonathan Leffler Jan 27 '14 at 01:58

What assumption is safe for a C++ implementation's character set?

2 Answers2