3

Certain GNU-based OS distros (Debian) are still impacted by a bug in GNU libc that causes the printf family of functions to return a bogus -1 when the specified level of precision would truncate a multi-byte character. This bug was fixed in 2.17 and backported to 2.16. Debian has an archived bug for this, but the maintainers appear to have no intention of backporting the fix to the 2.13 used by Wheezy.

The text below is quoted from https://sourceware.org/bugzilla/show_bug.cgi?id=6530. (Please do not edit the block quoting inline again.)

Here's a simpler testcase for this bug courtesy of Jonathan Nieder:

#include <stdio.h>
#include <locale.h>

int main(void)
{
    int n;

    setlocale(LC_CTYPE, "");
    n = printf("%.11s\n", "Author: \277");
    perror("printf");
    fprintf(stderr, "return value: %d\n", n);
    return 0;
}

Under a C locale that'll do the right thing:

$ LANG=C ./test
Author: &#65533;
printf: Success
return value: 10

But not under a UTF-8 locale, since \277 isn't a valid UTF-8 sequence:

$ LANG=en_US.utf8 ./test
printf: Invalid or incomplete multibyte or wide character

It's worth noting that printf will also overwrite the first character of the output array with \0 in this context.

I am currently trying to retrofit a MUD codebase to support UTF-8, and unfortunately the code is riddled with cases where arbitrary sprintf precision is used to limit how much text is sent to output buffers. This problem is made much worse by the fact that most programmers don't expect a -1 return in this context, which can result in uninitialized memory reads and badness that cascades down from that. (already caught a few cases in valgrind)

Has anyone come up with a concise workaround for this bug in their code that doesn't involve rewriting every single invocation of a formatting string with arbitrary length precision? I'm fine with truncated UTF-8 characters being written to my output buffer as it's fairly trivial to clean that up in my output processing prior to socket write, and it seems like overkill to invest this much effort in a problem that will eventually go away given a few more years.

Andrew B
  • 169
  • 14
  • As far as I can tell, if the character would be truncated it isn't output at all. You only get the -1 when trying to output something that isn't a valid character. – Zan Lynx Aug 19 '14 at 14:59
  • Interesting. On glibc 2.18 this behavior doesn't exist. printf seems to treat it as a byte string just as if it was in the C language. – Zan Lynx Aug 19 '14 at 15:08
  • @ZanLynx Sorry, the version numbers were specified in the link but I should have mentioned them in my post. It was fixed in 2.17 and backported to 2.16, but apparently Debian has [no intention](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308) to backport the fix to Wheezy's 2.13. – Andrew B Aug 19 '14 at 17:19
  • 1
    I am curious if there's any reason for your MUD code to actually process UTF-8, or if it is enough to just pass it around? If so, just force your locale to C and process all text as 8-bit clean buffers of bytes. – Zan Lynx Aug 19 '14 at 18:04
  • @ZanLynx I don't use `wchar_t` arrays internally. I still need to detect illegal byte sequences (trivial to do by hand admittedly) and identify code points, but it might not require re-inventing the wheel if I identify a library that will perform those functions in a C locale. I'll take a look at ICU tonight. – Andrew B Aug 19 '14 at 19:18
  • 1
    @ZanLynx: That doesn't work if you have a maximum length for any reason. To truncate, you need to interpret. – Mooing Duck Aug 19 '14 at 21:29
  • @MooingDuck: Most of the max length formats used in MUD code is for buffer limits. Normal use would never hit them, only people fuzzing it for security bugs. And for those people who cares if their UTF8 is corrupted. – Zan Lynx Aug 19 '14 at 21:34
  • @ZanLynx: Hopefully yes. However, sometimes buffers are smaller than you'd think (name buffers are commonly way too small). Additionally, anything that truncates to fit data into a table or otherwise aligned format will need to parse the string to figure out where to truncate, or will accidentally truncate early. – Mooing Duck Aug 19 '14 at 21:38
  • @ZanLynx Mooing Duck is correct, there are many instances of this in the code I'm working with. An example would be the debugger for our softcode tokenizer, which truncates strings well in advance of the buffer limit to keep the output manageable. In any event, I've already established that I need to interpret. – Andrew B Aug 19 '14 at 21:40
  • Ah, softcode. It is for the weak! Back in my day we wrote DikuMUD and all behavior was in C! To modify a mob script we rebuilt the server and reloaded at 2 am or whenever it next crashed! Our functions didn't have parameters. They had arguments! And they always won! – Zan Lynx Aug 19 '14 at 22:16

1 Answers1

1

I'm guessing, and it seems to be confirmed by the the comments to the question, that you don't use all that much of the C library's locale specific functionality. In that case you'd probably be better off not changing the locale to a UTF-8 based one, and leaving it in the single-byte locale your code assumes.

When you do need to process UTF-8 strings as UTF-8 strings you can use specialized code. It's not too hard to write your own UTF-8 processing routines. You can even download the Unicode Character Database and do some fairly sophisticated character classification. If you'd prefer to use a third party library to handle UTF-8 strings there's ICU as you mentioned in your comments. It's a pretty heavyweight library though, a previous question recommends a few lighter weight alternatives.

It might also be possible to switch the C locale back and forth as necessary so you can use the C library's functionality. You'll want to check the performance impact of this however, as switching locales can be an expensive operation.

Community
  • 1
  • 1
Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • I'd rather bring in a library than reinvent the wheel, especially since I'd gain access to Unicode character properties that are not exposed by the standard library. My original goal was to avoid adding dependencies, but that's wishful thinking when glibc is broken in major software distros. I agree about ICU being a bit overkill though and am currently evaluating alternatives. – Andrew B Aug 20 '14 at 03:21
  • It's not too hard to convert the UCD tables into C tables you can use access to Unicode character properties from your own code. – Ross Ridge Aug 20 '14 at 03:34
  • When all is said and done though, there's always going to be more code to write than you realize going in. Does a person *really* need to write their own implementation of how normalize input strings to NFC format, for example? I ended up going with [GNU libunistring](https://www.gnu.org/software/libunistring/manual/libunistring.html); it provided a good balance of simplicity, understandable documentation, and a set of `sprintf` family mappings for adapting existing column based formatting code. – Andrew B Aug 27 '14 at 05:22
  • Sure, having no idea what UTF-8 processing you needed to do, I was just pointing out as these things go handling UTF-8 in your own code isn't that hard. GNU libunistring sounds like a good fit for your project, but when you can't find an existing library that fits well, sometimes it's worth rolling your own rather than trying make something work. All I knew for sure was that writing your own code would be better than trying to make GNU libc do what you want. – Ross Ridge Aug 27 '14 at 06:10