Does the g++ 4.8.2 compiler support Unicode characters?

Question

Consider the following statements -

cout<<"\U222B";

int a='A';
cout<<a;

The first statement prints an integration sign (the character equivalent to the Unicode code point) whereas the second cout statement prints the ASCII value 65.

So I want to ask two things -

1) If my compiler supports Unicode character set then why it is implementing the ASCII character set and showing the ascii values of the characters?

2) With reference to this question - what is the difference in defining the 'byte' in terms of computer memory and in terms of C++?

Does my compiler implement 16-bit or 32-bit byte? If yes, then why do the value of CHAR_BIT is set to 8?

what's important is the console's support for Unicode, as the program simply outputs a byte sequence in some charset that will be treated some way by the console to print out, and if the console's Unicode support is bad, you can't see the expected result — phuclv, Nov 09 '15 at 03:05

paxdiablo · Answer 1 · 2015-11-09T02:36:28.593

In answer to your first question, the bottom 128 code points of Unicode are ASCII. There's no real distinction between the two.

The reason you're seeing 65 is because the thing you're outputting (a) is an int rather than a char (it may have started as a char but, by putting it into a, you modified how it would be treated in future).

For your second question, a byte is a char, at least as far as the ISO C and C++ standards are concerned. If CHAR_BIT is defined as 8, that's how wide your char type is.

However, you should keep in mind the difference between Unicode code points and Unicode representations (such as UTF-8). Having CHAR_BIT == 8 will still allow Unicode to work if UTF-8 representation is used.

My advice would be to capture the output of you program with a hex dump utility, you may well find the Unicode character is coming out as e2 88 ab, which is the UTF-8 representation of U+222B. It will then be interpreted by something outside of the program (eg, the terminal program) to render the correct glyph(s):

#include <iostream>
using namespace std;
int main() { cout << "\u222B\n"; }

Running that program above shows what's being output:

pax> g++ -o testprog testprog.cpp ; ./testprog
∫

pax> ./testprog | hexdump
0000000 e2 88 ab 0a

You could confirm that by generating the same UTF-8 byte sequence in a different way:

pax> printf "\xe2\x88\xab\n"
∫

score 0 · Answer 2 · edited May 23 '17 at 12:31

0

There are several different questions/issues here:

As paxdiablo pointed out, you're seeing "65" because you're outputting "a" (value 'A' = ASCII 65) as an "int".
Yes, gcc supports Unicode source files: --finput-charset=OPTION
The final issue is whether the C++ compiler treats your "strings" as 8-bit ASCII or n-bit Unicode.

C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian:

How well is Unicode supported in C++11?

PS:

As far as language support for Unicode:

Java was designed from the ground up for Unicode.

Unfortunately, at the time that meant only UTF-16. Java 5 supported nicode 6.0, Java 7 Unicode 6.0 and the current Java 8 supports Unicode 6.2.
.Net is newer. C#, VB.Net and C++/CLI all fully support Unicode 4.0.

Newer versions of .Net support newer versions of Unicode. For example, .Net 4.0 supports Unicode 5.1](What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?).
Python3 also supports Unicode 4.0: http://www.diveintopython3.net/strings.html

edited May 23 '17 at 12:31

Community

1
1

answered Nov 09 '15 at 02:31

paulsm4

114,292
17
138
190

I'm not sure what you mean by "Python3 supports Unicode 4.0". Python3 supports versions of Unicode *much* more recent than Unicode 4.0, to the extent that it is meaningful to talk about Python3 supporting a version of Unicode. As a datapoint, Python 3.5's `unicodedata` module corresponds to Unicode 8.0.0. – rici Nov 09 '15 at 03:02
The last time I looked (admittedly a long time ago), Python supported Unicode 4.0. SUGGESTION: If you have a link for current version/current Unicode support, please post it! – paulsm4 Nov 09 '15 at 05:04
As I said, I don't really know what you mean by "unicode x.y.z support", since Python does no text rendering. But it is easy to trace the `unicodedata` module, which defines, for example, which letters are valid in identifiers. See https://docs.python.org/3.5/library/unicodedata.html#module-unicodedata (and change the version number in that URL if necessary. – rici Nov 09 '15 at 05:33

ABu · Answer 3 · 2015-11-09T04:00:32.907

0

For of all, sorry for my English if it has mistakes.

A C++ byte is any defined amount of bits large enough to transport every character of a set specified by the standard. This required set of characters is a subset of ASCII, and that previously defined "amount of bits" must be the memory unit for chars, the tiniest memory atom of C++. Every other type must be a multiple of sizeof(char) (any C++ value is a bunch of chars continously stored on memory).

So, sizeof(char) must be 1 by definition, because is the memory measurement unit of C++. If that 1 means 1 physical byte or not is an implementation issue, but universally accepted as 1 byte.

What I don't understand is what do you mean with 16-bit or 32-bit byte.

Other related question is about the encoding your compiled applies to your source texts, literal strings included. A compiler, if I'm not wrong, normalizes each translation unit (source code file) to an encoding of its choice to handle the file.

I don't really know what happens under the hood, but perhaps you have read something somewhere about source file/internal encoding, and 16 bits/32bits encodings and all the mess is blended on your head. I'm still confused either though.

edited Nov 09 '15 at 04:00

answered Nov 09 '15 at 02:40

ABu

10,423
6
52
103

2

Unicode is absolutely a character set; it defines a set of characters, assigning each one an abstract code point. UTF-8 is an *encoding* of the Unicode character set, which maps each code point to a concrete sequence of bytes (possible of length 1). UTF-16 is a different encoding of the Unicode character set. CP-1252 is a one-byte encoding of a set of characters; the exact set of characters has no name, so it is sometimes confusingly also called CP-1252. (CP stands for "code page" so it clearly refers to the mapping.) – rici Nov 09 '15 at 02:59
`sizeof (char)` is 1 *byte* by definition, where a "byte" is 8 or more bits. (The C standard has its own definition of "byte". It's not necessarily the same as an octet, which is exactly 8 bits.) A C++ implementation *can* have 16-bit or 32-bit bytes, but encodings like UTF-8 make it possible to represent Unicode with just 8-bit bytes. – Keith Thompson Nov 09 '15 at 05:38
Are you sure the C++ standard says `sizeof(char)` must be at least 8 bits long? – ABu Nov 09 '15 at 16:54

Does the g++ 4.8.2 compiler support Unicode characters?

3 Answers3