1

Consider the following statement -

cout It displays an integration sign ( an Unicode character) if compiled on my g++ 4.8.2

1). Does it mean the basic character set of this implementation is also Unicode?

If yes, then consider the following statement -

C++ defines 'byte' differently. A C++ byte consists of enough no. of bits to accommodate at least the total no. of characters of the basic character set for implementation.

2). If my compiler supports the Unicode, then the no.of bits in a byte according to the above definition of 'byte' must be greater than 8. Hence CHAR_BIT >8 here, right? But my compiler shows CHAR_BIT == 8. WHY?

Reference : C++ Primer Plus

P.S. I'm a beginner. Don't throw me into the complex technical details. Keep it simple and straight. Thanks in advance!

3 Answers3

2

Unicode has nothing to do with your compiler or C++ defining "byte" differently. It's simply about separating the concept of "byte" and "character" at the string level and the string level alone.

The only time Unicode's multi-byte characters come into play is during display and when manipulating the strings. See also the difference between std::wstring and std::string for a more technical explanation.

The compiler just compiles. It doesn't care about your character set except when it comes to dealing with the source-code.

Bytes are, as always, 8 bits only.

Community
  • 1
  • 1
tadman
  • 208,517
  • 23
  • 234
  • 262
  • 1
    *Bytes are, as always, 8 bits only.* That is incorrect. There is no defenitve standard that says a byte is 8 bits and there are machines out there that do use larger words. http://stackoverflow.com/questions/5516044/system-where-1-byte-8-bit – NathanOliver Nov 09 '15 at 17:36
  • Historically speaking, yes, and in some really odd cases this is true as well, but rare is the programmer that *ever* has to deal with that. Also, pretty sure any system that has non-8-bit bytes doesn't support Unicode in any fashion. There are exceptions to any rule, obviously, but the general principle here is bytes are 8 bits and 8 bits only. The definitions for `short` and `long` are much more subjective. – tadman Nov 09 '15 at 17:45
1

Does it mean the basic character set of this implementation is also Unicode?

No, there is no such requirement, and there are very few implementations where char is large enough to hold arbitrary Unicode characters.

char is large enough to hold members of the basic character set, but what happens with characters that aren't in the basic character set depends.

On some systems, everything might be converted to one character set such as ISO8859-1 which has fewer than 256 characters, so fits entirely in char.

On other systems, everything might be encoded as UTF-8, meaning a single logical character potentially takes up several char values.

0

Many compilers support UTF-8, with the basic character set being ASCII. In UTF-8, a Unicode code point consists of 1 to 4 bytes, so typically 1 to 4 chars. UTF-8 is designed so that most of C and C++ works just fine with it without having any direct support. Just be aware that for example strlen () returns the number of bytes, not the number of code points. But most of the time you don't really care about that. (Functions like strncpy which are dangerous anyway become just slightly more dangerous with UTF-8).

And of course forget about using char to store a Unicode code point. But then once you get into a bit more sophisticated string handling, many, many things cannot be done on a character level anyway.

gnasher729
  • 51,477
  • 5
  • 75
  • 98