1

when running the following:

char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
    printf("acute_accent[%d]: %c\n", i, acute_accent[i]);
}

I get:

acute_accent[0]: 
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r

which makes me think that the multibyte character é is 2-byte wide.

However, when running this (after ignoring the compiler warning me from multi-character character constant):

printf("size: %lu",sizeof('é'));

I get size: 4.

What's the reason for the different sizes?

EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.

Community
  • 1
  • 1
OfirD
  • 9,442
  • 5
  • 47
  • 90
  • 2
    Constants in `'` quotes are of type `int`. Don't ignore warnings. – Eugene Sh. Feb 11 '16 at 14:46
  • What platform? On Windows you can use UCS-2 `wchar_t` but you still risk buffer overflow from composite codepoints and surrogate pairs. You should also specify the encoding of your strings otherwise the implementation is undefined: `u8"éclair";`. Possible duplicate: http://stackoverflow.com/questions/2172943/ – Zhro Feb 11 '16 at 14:48
  • Possible duplicate of [Size of character ('a') in C/C++](http://stackoverflow.com/questions/2172943/size-of-character-a-in-c-c) – Zhro Feb 11 '16 at 14:52
  • "after ignoring the compiler warning" - That is already enough. If you don't fully understand why the compiler warns, you should not ignore it. – too honest for this site Feb 11 '16 at 14:54
  • @EugeneSh. and Olaf, lesson learned, thank you. – OfirD Feb 11 '16 at 14:58

2 Answers2

2

The reason you're seeing a discrepancy is because in your first example, the character é was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9.

See here:

http://www.fileformat.info/info/unicode/char/e9/index.htm

And as described by dbush, the character 'é' was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.

Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.

To prevent undefined behavior you should always clearly identify the encoding type for string literals.

For example:

char acute_accent[7] = u8"éclair"

This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.

It's much safer to use this instead:

const char* acute_accent = u8"éclair"

Notice how your string is actually 8-bytes:

#include <stdio.h>
#include <string.h> // strlen

int main() {
    const char* a = u8"éclair";

    printf("String length : %lu\n", strlen(a));

    // Add +1 for the null byte
    printf("String size   : %lu\n", strlen(a) + 1);

    return 0;
}

The output is:

String length : 7
String size   : 8

Also note that the size of a char is different between C and C++!!

#include <stdio.h>

int main() {
    printf("%lu\n", sizeof('a'));

    printf("%lu\n", sizeof('é'));

    return 0;
}

In C the output is:

4
4

While in C++ the output is:

1
4
Zhro
  • 2,546
  • 2
  • 29
  • 39
0

From the C99 standard, section 6.4.4.4:

2 An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.

...

10 An integer character constant has type int.

sizeof(int) on your machine is probably 4, which is why you're getting that result.

So 'é', 'c', 'l' are all integer character constants, so all are of type int whose size is 4. The fact that some are multibyte and some are not doesn't matter in this regard.

dbush
  • 205,898
  • 23
  • 218
  • 273