printing the char value of each wide character's bytes

Question

when running the following:

char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
    printf("acute_accent[%d]: %c\n", i, acute_accent[i]);
}

I get:

acute_accent[0]: 
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r

which makes me think that the multibyte character é is 2-byte wide.

However, when running this (after ignoring the compiler warning me from multi-character character constant):

printf("size: %lu",sizeof('é'));

I get size: 4.

What's the reason for the different sizes?

EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.

Constants in `'` quotes are of type `int`. Don't ignore warnings. — Eugene Sh., Feb 11 '16 at 14:46
What platform? On Windows you can use UCS-2 `wchar_t` but you still risk buffer overflow from composite codepoints and surrogate pairs. You should also specify the encoding of your strings otherwise the implementation is undefined: `u8"éclair";`. Possible duplicate: http://stackoverflow.com/questions/2172943/ — Zhro, Feb 11 '16 at 14:48
Possible duplicate of [Size of character ('a') in C/C++](http://stackoverflow.com/questions/2172943/size-of-character-a-in-c-c) — Zhro, Feb 11 '16 at 14:52
"after ignoring the compiler warning" - That is already enough. If you don't fully understand why the compiler warns, you should not ignore it. — too honest for this site, Feb 11 '16 at 14:54

Zhro · Accepted Answer · 2016-02-11T15:47:32.030

The reason you're seeing a discrepancy is because in your first example, the character é was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9.

See here:

http://www.fileformat.info/info/unicode/char/e9/index.htm

And as described by dbush, the character 'é' was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.

Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.

To prevent undefined behavior you should always clearly identify the encoding type for string literals.

For example:

char acute_accent[7] = u8"éclair"

This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.

It's much safer to use this instead:

const char* acute_accent = u8"éclair"

Notice how your string is actually 8-bytes:

#include <stdio.h>
#include <string.h> // strlen

int main() {
    const char* a = u8"éclair";

    printf("String length : %lu\n", strlen(a));

    // Add +1 for the null byte
    printf("String size   : %lu\n", strlen(a) + 1);

    return 0;
}

The output is:

String length : 7
String size   : 8

Also note that the size of a char is different between C and C++!!

#include <stdio.h>

int main() {
    printf("%lu\n", sizeof('a'));

    printf("%lu\n", sizeof('é'));

    return 0;
}

In C the output is:

4
4

While in C++ the output is:

1
4

dbush · Answer 2 · 2016-02-11T15:30:41.970

0

From the C99 standard, section 6.4.4.4:

2 An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.

...

10 An integer character constant has type int.

sizeof(int) on your machine is probably 4, which is why you're getting that result.

So 'é', 'c', 'l' are all integer character constants, so all are of type int whose size is 4. The fact that some are multibyte and some are not doesn't matter in this regard.

edited Feb 11 '16 at 15:30

answered Feb 11 '16 at 14:52

dbush

205,898
23
218
273

But what is "An _integer_ character constant"? – Paul Ogilvie Feb 11 '16 at 14:54
@PaulOgilvie Added more detail. – dbush Feb 11 '16 at 14:57
@dbush, thank you, and a follow-up question: the `é` is clearly represented over 2 bytes on my machine (ubuntu 14.04, 64-bit). But wasn't it supposed to be 4 bytes? – OfirD Feb 11 '16 at 15:09

printing the char value of each wide character's bytes

2 Answers2