0

I'm currently learning C and lately, I have been focusing on the topic of character encoding. Note that I'm a Windows programmer. While I currently test my code only on Windows, I want to eventually port it to Linux and macOS, so I'm trying to learn the best practices right now.

In the example below, I store a file path in a wchar_t variable to be opened later on with _wfopen. I need to use _wfopen because my file path may contain chars not in my default codepage. Afterwards, the file path and a text literal is stored inside a char variable named message for further use. My understanding is that you can store a wide string into a multibyte string with the %ls modifier.

char message[8094] = "";
wchar_t file_path[4096] = L"C:\\test\\test.html";
sprintf(message, "Accessing: %ls\n", file_path);

While the code works, GCC/MinGW outputs the following warning and notes:

warning: '%ls' directive writing up to 49146 bytes into a region of size 8083 [-Wformat-overflow=]|
note: assuming directive output of 16382 bytes|
note: 'sprintf' output between 13 and 49159 bytes into a destination of size 8094|

My issue is that I simply do not understand how sprintf could output up to 49159 bytes into the message variable. I output the Accessing: string literal, the file_path variable, the \n char and the \0 char. What else is there to output?

Sure, I could declare message as a wchar_t variable and use wsprintf instead of sprintf, but my understanding is that wchar_t does not make up for nice portable code. As such, I'm trying to avoid using it unless it's required by a specific API.

So, what am I missing?

Pascal Bergeron
  • 761
  • 3
  • 12
  • 27

1 Answers1

2

The warning doesn't take into account the actual contents of file_path , it is calculated based on file_path having any possible content . There would be an overflow if file_path consisted of 4095 emoji and a null terminator.

Using %ls in narrow printf family converts the source to multi-byte characters which could be several bytes for each wide character.

To avoid this warning you could:

  • disable it with -Wno-format-overflow
  • use snprintf instead of sprintf

The latter is always a good idea IMHO, it is always good to have a second line of defence against mistakes introduced in code maintenance later (e.g. someone comes along and changes the code to grab a path from user input instead of hardcoded value).


After-word. Be very careful using wide characters and printf family in MinGW , which implements the printf family by calling MSVCRT which does not follow the C Standard. Further reading

To get closer to standard behaviour, use a build of MinGW-w64 which attempts to implement stdio library functions itself, instead of deferring to MSVCRT. (E.g. MSYS2 build).

M.M
  • 138,810
  • 21
  • 208
  • 365
  • Correct me if I'm wrong, but isn't an emoji at most 4 bytes? (4095 emoji x 4 bytes) + 1 byte for the null terminator would amount to 16,381 bytes. This is close to the number in the first note (16,382 bytes), but it doesn't match at all the number in the first warning (49,146 bytes) or the second note (49,159 bytes). – Pascal Bergeron Sep 20 '22 at 00:24
  • 1
    @PascalBergeron well it depends what character sets are in use. `wchar_t` is not necessarily Unicode, and the MBCS is not necessarily UTF-8 (and in fact it historically wasn't, and I'm not sure if that is even the default now, and on what compilers). The numbers involved seem to be allowing for up to 12 bytes of MBCS per `wchar_t` unit. It's also possible that the warnings are coded based on systems that have 4-byte `wchar_t` and not tailored for Windows 2-byte `wchar_t`. – M.M Sep 20 '22 at 03:02