Certain GNU-based OS distros (Debian) are still impacted by a bug in GNU libc that causes the printf
family of functions to return a bogus -1
when the specified level of precision would truncate a multi-byte character. This bug was fixed in 2.17 and backported to 2.16. Debian has an archived bug for this, but the maintainers appear to have no intention of backporting the fix to the 2.13 used by Wheezy.
The text below is quoted from https://sourceware.org/bugzilla/show_bug.cgi?id=6530. (Please do not edit the block quoting inline again.)
Here's a simpler testcase for this bug courtesy of Jonathan Nieder:
#include <stdio.h>
#include <locale.h>
int main(void)
{
int n;
setlocale(LC_CTYPE, "");
n = printf("%.11s\n", "Author: \277");
perror("printf");
fprintf(stderr, "return value: %d\n", n);
return 0;
}
Under a C locale that'll do the right thing:
$ LANG=C ./test
Author: �
printf: Success
return value: 10
But not under a UTF-8 locale, since
\277
isn't a valid UTF-8 sequence:
$ LANG=en_US.utf8 ./test
printf: Invalid or incomplete multibyte or wide character
It's worth noting that printf
will also overwrite the first character of the output array with \0
in this context.
I am currently trying to retrofit a MUD codebase to support UTF-8, and unfortunately the code is riddled with cases where arbitrary sprintf
precision is used to limit how much text is sent to output buffers. This problem is made much worse by the fact that most programmers don't expect a -1
return in this context, which can result in uninitialized memory reads and badness that cascades down from that. (already caught a few cases in valgrind)
Has anyone come up with a concise workaround for this bug in their code that doesn't involve rewriting every single invocation of a formatting string with arbitrary length precision? I'm fine with truncated UTF-8 characters being written to my output buffer as it's fairly trivial to clean that up in my output processing prior to socket write, and it seems like overkill to invest this much effort in a problem that will eventually go away given a few more years.