The reason you're seeing a discrepancy is because in your first example, the character é
was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9
.
See here:
http://www.fileformat.info/info/unicode/char/e9/index.htm
And as described by dbush, the character 'é'
was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.
Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.
To prevent undefined behavior you should always clearly identify the encoding type for string literals.
For example:
char acute_accent[7] = u8"éclair"
This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.
It's much safer to use this instead:
const char* acute_accent = u8"éclair"
Notice how your string is actually 8-bytes:
#include <stdio.h>
#include <string.h> // strlen
int main() {
const char* a = u8"éclair";
printf("String length : %lu\n", strlen(a));
// Add +1 for the null byte
printf("String size : %lu\n", strlen(a) + 1);
return 0;
}
The output is:
String length : 7
String size : 8
Also note that the size of a char is different between C and C++!!
#include <stdio.h>
int main() {
printf("%lu\n", sizeof('a'));
printf("%lu\n", sizeof('é'));
return 0;
}
In C the output is:
4
4
While in C++ the output is:
1
4