3

Let's assume there's only C99 Standard paper and printf library function needs to be implemented according to this standard to work with UTF-16 encoding, could you please clarify the expected behavior for s conversion with precision specified?

C99 Standard (7.19.6.1) for s conversion says:

If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type. Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character.

If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte). If no precision is specified, the array shall contain a null wide character. If a precision is specified, no more than that many bytes are written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the multibyte character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case is a partial multibyte character written.

I don't quite understand this paragraph in general and the statement "If a precision is specified, no more than that many bytes are written" in particular.

For example, let's take UTF-16 string "TEST" (byte sequence: 0x54, 0x00, 0x45, 0x00, 0x53, 0x00, 0x54, 0x00).

What is expected to be written to the output buffer in the following cases:

  • If precision is 3
  • If precision is 9 (one byte more than string length)
  • If precision is 12 (several bytes more than string length)

Then there's also "Wide characters from the array are converted to multibyte characters". Does it mean UTF-16 should be converted to UTF-8 first? This is pretty strange in case I expect to work with UTF-16 only.

Community
  • 1
  • 1
Alexander Zhak
  • 9,140
  • 4
  • 46
  • 72
  • `%s` takes a string. You can't store UTF-16 in a C string because of all the null bytes. (Also, UTF-16 is not a character set; it's an encoding.) – melpomene Sep 25 '16 at 13:02
  • `"\x54\x00\x45\x00\x53\x00\x54\x00"` is a string of length 1, containing `T`. – melpomene Sep 25 '16 at 13:03
  • Are you specifically asking about `%ls` / `wchar_t`? – melpomene Sep 25 '16 at 13:05
  • @melpomene, yes, I ask specifically about wide chars. Basically, it's about implementing printf to be used in UEFI applications. UEFI uses UCS-2 encoding for console input/output, so it would be obvious to stick to it. However, I'd like the function to be compatible with the standard, so that its output won't be a surprise for people who got used to standard C library – Alexander Zhak Sep 25 '16 at 13:10
  • OK, so I've looked at the standard and it looks a bit hairy. In particular, (from an implementer's perspective) UTF-16 seems unsuitable as a "native" representation (i.e. directly supported by C). It can't be used for "wide characters" (`wchar_t`) because a single character may take 2 UTF-16 units (surrogates). It can't be used as a "multibyte character set" because those must represent "basic characters" using one byte only. – melpomene Sep 25 '16 at 13:38
  • Do you control the compiler? How big is its `wchar_t` and how does it translate wide literals (`L"..."`, `L'.'`)? – melpomene Sep 25 '16 at 13:39
  • `wchar_t` is 16-bit – Alexander Zhak Sep 25 '16 at 13:52
  • 1
    What is the value of `CHAR_BIT` in your implementation? If `CHAR_BIT == 8`, you can't handle UTF-16 with `%s`; you'd use `%ls` and you'd pass a `wchar_t *` as the corresponding argument. You'd then have to read the second paragraph of the specification. If `CHAR_BIT == 16`, then you can't have an odd number of octets in the data. You then need to know about how `wchar_t` relates to `char` (are they the same size? do they have the same signedness?) and interpret both paragraphs to come up with a uniform effect — unless you decide to have `wchar_t` represent UTF-32. – Jonathan Leffler Sep 25 '16 at 21:31
  • 1
    @melpomene of course you can, you just need 16-bit characters. – n. m. could be an AI Sep 25 '16 at 21:45
  • @JonathanLeffler, thanks! Your comment makes it all clear. Byte is not necessarily 8 bits as defined in the standard: 3.6. byte - addressable unit of data storage large enough to hold any member of the basic character set of the execution environment. – Alexander Zhak Sep 25 '16 at 23:39

2 Answers2

1

Converting a comment into a slightly expanded answer.

What is the value of CHAR_BIT in your implementation?

  • If CHAR_BIT == 8, you can't handle UTF-16 with %s; you'd use %ls and you'd pass a wchar_t * as the corresponding argument. You'd then have to read the second paragraph of the specification.

  • If CHAR_BIT == 16, then you can't have an odd number of octets in the data. You then need to know about how wchar_t relates to char (are they the same size? do they have the same signedness?) and interpret both paragraphs to come up with a uniform effect — unless you decided to have wchar_t represent UTF-32.

The key point is that UTF-16 cannot be handled as a C string if CHAR_BIT == 8 because there are too many useful characters that are encoded with one byte holding zero, but those zero bytes mark the end of a null-terminated string. To handle UTF-16, either the plain char type has to be a 16-bit (or larger) type (so CHAR_BIT > 8), or you have to use wchar_t (and sizeof(wchar_t) > sizeof(char)).

Note that the specification expects that wide characters will be converted to a suitable multibyte representation.

If you want wide characters output natively, you have to use the fwprintf() and related function from <wchar.h>, first defined in C99. The specification there has a lot in common with the specification of fprintf(), but there are (unsurprisingly) important differences.

7.29.2.1 The fwprintf function

s
If no l length modifier is present, the argument shall be a pointer to the initial element of a character array containing a multibyte character sequence beginning in the initial shift state. Characters from the array are converted as if by repeated calls to the mbrtowc function, with the conversion state described by an mbstate_t object initialized to zero before the first multibyte character is converted, and written up to (but not including) the terminating null wide character. If the precision is specified, no more than that many wide characters are written. If the precision is not specified or is greater than the size of the converted array, the converted array shall contain a null wide character.

If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are written up to (but not including) a terminating null wide character. If the precision is specified, no more than that many wide characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null wide character.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
1

wchar_t is not meant to be used for UTF-16, only for implementation-defined fixed-width encodings depending on the current locale. There's simply no sane way to support a variable-length encoding with the wide character API. Likewise, the multi-byte representation used by functions like printf or wcrtomb is implementation-defined. If you want to write portable code using Unicode, you can't rely on the wide character API. Use a library or roll your own code.

To answer your question: fprintf with the l modifier accepts a wide character string in the implementation-defined encoding specified by the current locale. If wchar_t is 16 bits, this encoding might be a bastardization of UTF-16, but as I mentioned above, there's no way to properly support UTF-16 surrogates. This wchar_t string is then converted to a multi-byte char string in an implementation-defined encoding. This might or might not be UTF-8. The specified precision limits the number of chars in the output string with the added restriction that no partial multi-byte characters are written.

Here's an example. Let's assume that the wide character encoding is UTF-32 with 32-bit wchar_t and that the multi-byte encoding is UTF-8 (like on Linux with an appropriate locale). The following code

wchar_t w[] = { 0x1F600, 0 }; // U+1F600 GRINNING FACE
printf("%.3ls", w);

will print nothing at all since the resulting UTF-8 sequence has four bytes. Only if you specify a precision of at least four

printf("%.4ls", w);

the character will be printed.

EDIT: To answer your second question, no, printf should never write a null character. The sentence only means that in certain cases, a null character is required to specify the end of the string and avoid buffer over-reads.

Community
  • 1
  • 1
nwellnhof
  • 32,319
  • 7
  • 89
  • 113