3

In C11, support for portable wide char types char16_t and char32_t are added for UTF-16 and UTF-32 respectively.

However, in the technical report, there is no mention of endianness for these two types.

For example, the following snippet in gcc-4.8.4 on my x86_64 computer when compiled with -std=c11:

#include <stdio.h>
#include <uchar.h>

char16_t utf16_str[] = u"十六";  // U+5341 U+516D
unsigned char *chars = (unsigned char *) utf16_str;
printf("Bytes: %X %X %X %X\n", chars[0], chars[1], chars[2], chars[3]);

will produce

Bytes: 41 53 6D 51

Which means that it's little-endian.

But is this behaviour platform/implementation dependent: does it always adhere to the platform's endianness or may some implementation choose to always implement char16_t and char32_t in big-endian?

Community
  • 1
  • 1
Ryan Li
  • 9,020
  • 7
  • 33
  • 62
  • Suggest adding "why" endianness is important for the code. How does code plan to use the endianness? To make some conversion easier using type spoofing? IMO, endianness should play a minuscule role in robust portable code. – chux - Reinstate Monica Jul 15 '15 at 15:30

3 Answers3

6

char16_t and char32_t do not guarantee Unicode encoding. (That is a C++ feature.) The macros __STDC_UTF_16__ and __STDC_UTF_32__, respectively, indicate that Unicode code points actually determine the fixed-size character values. See C11 §6.10.8.2 for these macros.

(By the way, __STDC_ISO_10646__ indicates the same thing for wchar_t, and it also reveals which Unicode edition is implemented via wchar_t. Of course, in practice, the compiler simply copies code points from the source file to strings in the object file, so it doesn't need to know much about particular characters.)

Given that Unicode encoding is in effect, code point values stored in char16_t or char32_t must have the same object representation as uint_least16_t and uint_least32_t, because they are defined to be typedef aliases to those types, respectively (C11 §7.28). This is again somewhat in contrast to C++, which makes those types distinct but explicitly requires compatible object representation.

The upshot is that yes, there is nothing special about char16_t and char32_t. They are ordinary integers in the platform's endianness.

However, your test program has nothing to do with endianness. It simply uses the values of the wide characters without inspecting how they map to bytes in memory.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
3

However, in the technical report, there is no mention of endianness for these two types.

Indeed. The C standard doesn't specify much regarding the representation of multibyte characters in source files.

char16_t utf16_str[] = u"十六"; // U+5341 U+516D
printf("U+%X U+%X\n", utf_16_str[0], utf_16_str[1]);

will produce U+5341 U+516D Which means that it's little-endian.

But is this behaviour platform/implementation dependent: does it always adhere to the platform's endianness or may some implementation choose to always implement char16_t and char32_t in big-endian?

Yes, The behaviour is implementation dependent, as you call it. See C11§5.1.1.2:

Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.

That is, whether the multibyte characters in your source code are considered big endian or little endian is implementation-defined. I would advise using something like u"\u5341\u516d", if portability is an issue.

autistic
  • 1
  • 3
  • 35
  • 80
  • I updated the snippet to print out `unsigned char`s directly to make my intentions clear. If `char32_t` was encoded with big-endian, printf("U+%X\n", utf16_str[0]) would have resulted in `U+4153` because `unsigned int` is little-endian on x86_64. – Ryan Li Jul 15 '15 at 15:13
  • That makes your intentions no more clear. An implementation may use a little endian source character set and a big endian execution character set. Consider cross-compiling, for example. – autistic Jul 15 '15 at 15:19
  • Nonetheless, I've updated my answer because I realised what you were actually asking about... See the update. – autistic Jul 15 '15 at 15:20
  • But why would the source character set matter here? No matter if I save the snippet in UTF-16LE or UTF-16BE, the first character in `utf16_str` would always have the code point of 0x5341 in UTF-16. – Ryan Li Jul 15 '15 at 15:26
  • Translation phase 1 (which I quoted) says multibyte characters are mapped in an implementation-defined manner. Why? It is because it is. – autistic Jul 15 '15 at 15:29
0

UTF-16 and UTF-32 does not have an endianness defined. They are usually encoded in the hosts native byte ordering. This is why there are Byte Order Markers (BOM) which can be inserted at the beginning of the string to indicate the endianness for an UTF-16 or UTF-32 string.

  • What is the good practice when this data is from a file with a BOM, to switch the bytes or to touch nothing but interpret while reading? – Sandburg May 31 '21 at 09:21