How multibyte string is converted to wide-character string in fxprintf.c in glibc?

Question

Currently, the logic in glibc source of perror is such:

If stderr is oriented, use it as is, else dup() it and use perror() on dup()'ed fd.

If stderr is wide-oriented, the following logic from stdio-common/fxprintf.c is used:

size_t len = strlen (fmt) + 1;
wchar_t wfmt[len];
for (size_t i = 0; i < len; ++i)
  {
    assert (isascii (fmt[i]));
    wfmt[i] = fmt[i];
  }
res = __vfwprintf (fp, wfmt, ap);

The format string is converted to wide-character form by the following code, which I do not understand:

wfmt[i] = fmt[i];

Also, it uses isascii assert:

assert (isascii(fmt[i]));

But the format string is not always ascii in wide-character programs, because we may use UTF-8 format string, which can contain non-7bit value(s). Why there is no assert warning when we run the following code (assuming UTF-8 locale and UTF-8 compiler encoding)?

#include <stdio.h>
#include <errno.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  fwide(stderr, 1);
  errno = EINVAL;
  perror("привет мир");  /* note, that the string is multibyte */
  return 0;
}

$ ./a.out 
привет мир: Invalid argument

Can we use dup() on wide-oriented stderr to make it not wide-oriented? In such case the code could be rewritten without using this mysterious conversion, taking into account the fact that perror() takes only multibyte strings (const char *s) and locale messages are all multibyte anyway.

Turns out we can. The following code demonstrates this:

#include <stdio.h>
#include <wchar.h>
#include <unistd.h>
int main(void)
{
  fwide(stdout,1);
  FILE *fp;
  int fd = -1;
  if ((fd = fileno (stdout)) == -1) return 1;
  if ((fd = dup (fd)) == -1) return 1;
  if ((fp = fdopen (fd, "w+")) == NULL) return 1;
  wprintf(L"stdout: %d, dup: %d\n", fwide(stdout, 0), fwide(fp, 0));
  return 0;
}

$ ./a.out 
stdout: 1, dup: 0

BTW, is it worth posting an issue about this improvement to glibc developers?

NOTE

Using dup() is limited with respect to buffering. I wonder if it is considered in the implementation of perror() in glibc. The following example demonstrates this issue. The output is done not in the order of writing to the stream, but in the order in which the data in the buffer is written-off. Note, that the order of values in the output is not the same as in the program, because the output of fprintf is written-off first (because of "\n"), and the output of fwprintf is written off when program exits.

#include <wchar.h>
#include <stdio.h>
#include <unistd.h>
int main(void)
{
  wint_t wc = L'b';
  fwprintf(stdout, L"%lc", wc);

  /* --- */

  FILE *fp;
  int fd = -1;
  if ((fd = fileno (stdout)) == -1) return 1;
  if ((fd = dup (fd)) == -1) return 1;
  if ((fp = fdopen (fd, "w+")) == NULL) return 1;

  char c = 'h';
  fprintf(fp, "%c\n", c);
  return 0;
}

$ ./a.out 
h
b

But if we use \n in fwprintf, the output is the same as in the program:

$ ./a.out 
b
h

perror() manages to get away with that, because in GNU libc stderr is unbuffered. But will it work safely in programs where stderr is manually set to buffered mode?

This is the patch that I would propose to glibc developers:

diff -urN glibc-2.24.orig/stdio-common/perror.c glibc-2.24/stdio-common/perror.c
--- glibc-2.24.orig/stdio-common/perror.c   2016-08-02 09:01:36.000000000 +0700
+++ glibc-2.24/stdio-common/perror.c    2016-10-10 16:46:03.814756394 +0700
@@ -36,7 +36,7 @@

   errstring = __strerror_r (errnum, buf, sizeof buf);

-  (void) __fxprintf (fp, "%s%s%s\n", s, colon, errstring);
+  (void) _IO_fprintf (fp, "%s%s%s\n", s, colon, errstring);
 }


@@ -55,7 +55,7 @@
      of the stream.  What is supposed to happen when the stream isn't
      oriented yet?  In this case we'll create a new stream which is
      using the same underlying file descriptor.  */
-  if (__builtin_expect (_IO_fwide (stderr, 0) != 0, 1)
+  if (__builtin_expect (_IO_fwide (stderr, 0) < 0, 1)
       || (fd = __fileno (stderr)) == -1
       || (fd = __dup (fd)) == -1
       || (fp = fdopen (fd, "w+")) == NULL)

Note that `wchar_t` is not relevant for strings in UTF-8, since it's a *wide* character, capable of representing a larger-than-`char` encoding space as single values. It seems to assume that the argument to `perror()` be all-ASCII, not sure why. — unwind, Oct 10 '16 at 07:48
@unwind `wchar_t` itself uses *internal encoding* (UCS-4 in glibc), but there is also *compiler encoding* (UTF-8 in my example), which is used in multibyte strings constants. But the code in `fxprintf.c` somehow converts format string from UTF-8 to UCS-4 (to pass it to `__vfwprintf`). I completely don't understand how it succeeds. But most of all I wonder *why* it is done. — Igor Liferenko, Oct 10 '16 at 07:57
Hm ... interesting, that distinction is very rarely used. And I agree, there's no way that `for` loop does any kind of conversion which is not "throw some bits away". — unwind, Oct 10 '16 at 08:00

n. m. could be an AI · Answer 1 · 2016-10-10T11:20:11.593

it uses isascii assert.

This is OK. You are not supposed to call this function. It is a glibc internal. Note the two underscores in front of the name. When called from perror, the argument in question is "%s%s%s\n", which is entirely ASCII.

But the format string is not always ascii in wide-character programs, because we may use UTF-8

First, UTF-8 has nothing to do with wide characters. Second, the format string is always ASCII because the function is only called by other glibc functions that know what they are doing.

perror("привет мир");

This is not the format string, this is one of the arguments that corresponds to one of the %s in the actual format string.

Can we use dup() on wide-oriented stderr

You cannot use dup on a FILE*, it operates on POSIX file descriptors that don't have orientation.

This is the patch that I would propose to glibc developers:

Why? What isn't working?

score 1 · Accepted Answer · answered Oct 10 '16 at 20:56

NOTE: It wasn't easy to find concrete questions in this post; on the whole, the post seems to be an attempt to engage in a discussion about implementation details of glibc, which it seems to me would be better directed to a forum specifically oriented to development of that library such as the libc-alpha mailing list. (Or see https://www.gnu.org/software/libc/development.html for other options.) This sort of discussion is not really a good match for StackOverflow, IMHO. Nonetheless, I tried to answer the questions I could find.

How does wfmt[i] = fmt[i]; convert from multibyte to wide character?

Actually, the code is:
```
assert(isascii(fmt[i]));
wfmt[i] = fmt[i];
```
which is based on the fact that the numeric value of an ascii character is the same as a wchar_t. Strictly speaking, this need not be the case. The C standard specifies:

Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_MB_MIGHT_NEQ_WC__. (§7.19/2)

(gcc does not define that symbol.)

However, that only applies to characters in the basic set, not to all characters recognized by isascii. The basic character set contains the 91 printable ascii characters as well as space, newline, horizontal tab, vertical tab and form feed. So it is theoretically possible that one of the remaining control characters will not be correctly converted. However, the actual format string used in the call to __fxprintf only contains characters from the basic character set, so in practice this pedantic detail is not important.
Why there is no assert warning when we execute perror("привет мир");?

Because only the format string is being converted, and the format string (which is "%s%s%s\n") contains only ascii characters. Since the format string contains %s (and not %ls), the argument is expected to be char* (and not wchar_t*) in both the narrow- and wide-character orientations.
Can we use dup() on wide-oriented stderr to make it not wide-oriented?

That would not be a good idea. First, if the stream has an orientation, it might also have a non-empty internal buffer. Since that buffer is part of the stdio library and not of the underlying Posix fd, it will not be shared with the duplicate fd. So the message printed by perror might be interpolated in the middle of some existing output. In addition, it is possible that the multibyte encoding has shift states, and that the output stream is not currently in the initial shift state. In that case, outputting an ascii sequence could result in garbled output.

In the actual implementation, the dup is only performed on streams without orientation; these streams have never had any output directed at them, so they are definitely still in the initial shift state with an empty buffer (if the stream is buffered).
Is it worth posting an issue about this improvement to glibc developers?

That is up to you, but don't do it here. The normal way of doing that would be to file a bug. There is no reason to believe that glibc developers read SO questions, and even if they do, someone would have to copy the issue to a bug, and also copy any proposed patch.

concerning 1. - I believe that §6.3.1.3 from C11 is applicable here, and *all* ascii codes must be correctly converted. See quote from C11 here http://stackoverflow.com/a/25080246/1487773 — Igor Liferenko, Mar 13 '17 at 14:39
in particular, will this code work correctly? `wchar_t c=... if (c>' ' && c!=0177) { /* visible ASCII and all non-ASCII codes */ ...` — Igor Liferenko, Mar 13 '17 at 14:53
@IgorLiferenko §6.3.1.3 deals with converting integers from one integer type to another, and there is no problem converting the integer 7 stored in a `char` to a 7 stored in a `wchar_t`. However, §6.3.1.3 does not guarantee anything more than that; in particular, it does not guarantee that 7 is the wchar_t code for a alert (`\a`). That was what my answer was about. §7.19/2 guarantees that all values between `' '` and `\176` have common semantics as `char` and `wchar_t`, but `wchar_t` could theoretically place non-ascii characters in codes less than 32 (other than 0 and whitespace). — rici, Mar 13 '17 at 15:37
This is only possible if `wchar_t` uses arbitrary charset (not Unicode). But in such case values between `' '` and `\176` need not correspond either. Why this contradiction? UTF-8 is valid encoding for Unicode. The requirement of UTF-8 is that it matches ASCII in `\000-\177` range. According to the structure of UTF-8, it follows that all ascii codes are Unicode values, and vice versa. In other words, the following transformations are always valid: `wc = (wchar_t)c;` and `c = (char)wc;` where `c` is of type `char` and `wc` is of type `wchar_t`, and `c` contains ascii codes. Is it true? — Igor Liferenko, Mar 27 '17 at 07:48
@igor: there is no guarantee that `char` values will be interpreted as Ascii, that multibyte characters are UTF-8, that wchar_t is any Unicode encoding, or that `char` and `wchar_t` are compatible. If you assume all of those things, a wider range of compatible code is possible, but that code is no loner portable. — rici, Mar 27 '17 at 08:02

How multibyte string is converted to wide-character string in fxprintf.c in glibc?

2 Answers2

How does `wfmt[i] = fmt[i];` convert from multibyte to wide character?

Why there is no assert warning when we execute `perror("привет мир");`?

Can we use `dup()` on wide-oriented stderr to make it not wide-oriented?

Is it worth posting an issue about this improvement to glibc developers?

Linked

How multibyte string is converted to wide-character string in fxprintf.c in glibc?

2 Answers2

How does wfmt[i] = fmt[i]; convert from multibyte to wide character?

Why there is no assert warning when we execute perror("привет мир");?

Can we use dup() on wide-oriented stderr to make it not wide-oriented?

Is it worth posting an issue about this improvement to glibc developers?

Linked

How does `wfmt[i] = fmt[i];` convert from multibyte to wide character?

Why there is no assert warning when we execute `perror("привет мир");`?

Can we use `dup()` on wide-oriented stderr to make it not wide-oriented?