2

I am trying to convert Windows wchar_t[] to a UTF-8 encoding char[] so that calls to WriteFile will produce UTF-8 encoded files. I have the following code:

#include <windows.h>
#include <fileapi.h>
#include <stringapiset.h>

int main() {
    HANDLE file = CreateFileW(L"test.txt", GENERIC_ALL, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    const wchar_t source[] = L"hello";
    char buffer[100];
    WideCharToMultiByte(CP_UTF8, 0, source, sizeof(source)/sizeof(source[0]), buffer, sizeof(buffer)/sizeof(buffer[0]), NULL, NULL);
    WriteFile(file, buffer, sizeof(buffer), NULL, NULL);
    return CloseHandle(file);
}

This produces a file containing: "hello" but also a large amount of garbage after it. enter image description here

Something about this caused me to think the issue was more than just simply dumping the excess characters in buffer and that the conversion wasn't happening properly, so I changed the source text as follows:

const wchar_t source[] = L"привет";

And this time got the following garbage:

enter image description here

So then thinking maybe it's getting confused because it's looking for a null terminator and not finding one, even though lengths are specified? So I change the source string again:

const wchar_t source[] = L"hello\n";

And got the following garbage:

enter image description here

I'm fairly new to the WinAPI's, and am not primarily a C developer, so I'm sure I'm missing something, I just don't know what else to try.

edit: Following the advice from RbMm has removed the excess garbage, so English prints correctly. However, the Russian is still garbage, just shorter garbage. Contrary to zett42's comment, I am most definately using a UTF-8 text editor.

enter image description here

UTF-8 doesn't need a BOM, but adding one in anyways produces:

enter image description here

Well that's odd. I expected the same text with a slightly larger binary size. Instead there's nothing.

edit:

Since some are keen on sticking to the idea that I'm using WordPad, here's what WordPad looks like

enter image description here

I'm clearly not using WordPad. I'm using VS Code, although the garbage is indentical whether opened in VS Code, Visual Studio, Notepad, or Notepad++.

edit:

Here's the hex dump of the output from Russian:

enter image description here

Patrick Kelly
  • 633
  • 5
  • 22
  • 2
    you write unconditionally `sizeof(buffer)` instead `strlen(buffer)`. of course will be garbage from stack – RbMm Jul 21 '19 at 15:23
  • 3
    With sizeof(buffer) that WriteFile() call writes too much. Simply use the return value of WideCharToMultiByte() instead. – Hans Passant Jul 21 '19 at 15:23
  • _"And this time got the following garbage"_ -- not garbage (at least not in the actual characters you converted), just a viewer that doesn't understand or is not configured for UTF-8. Writing a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) could help. – zett42 Jul 21 '19 at 15:35
  • @zett42 I am most definately using a UTF-8 editor actively set to UTF-8 interpretation; I've attached a screenshot in the edit. – Patrick Kelly Jul 21 '19 at 15:55
  • Strange, the output you are seeing is exactly what I get when viewing a UTF-8 encoded file containing "привет" in ASCII character set only. – zett42 Jul 21 '19 at 16:04
  • That's what confuses me. I am using `CP_UTF8` but seemingly getting ASCII instead. I can't figure out from the docs what I am doing wrong. – Patrick Kelly Jul 21 '19 at 16:05
  • What do you see when opening the file in notepad? – zett42 Jul 21 '19 at 16:05
  • With a default open it correctly determines it is UTF-8 and shows "привет". Opening in ANSI instead shows "ÿрøòõт". And opening in UTF-16LE shows "郃뿂釃苢쎬슐쎸슐쎲슐쎵骀" – Patrick Kelly Jul 21 '19 at 16:08
  • 1
    Make sure to save your source file in UTF-8 too! See https://stackoverflow.com/a/53259212/7571258 – zett42 Jul 21 '19 at 16:14
  • You are writing garbage. Garbage is rarely valid UTF-8. Having a code editor interpret something that isn't UTF-8 as if it were has unpredictable results. If you want to be sure that your editor doesn't get in your way of analyzing the output, don't have it interpret your data. Open it in a hex editor instead. – IInspectable Jul 21 '19 at 17:34
  • @IInspectable Clearly I am. In no way does that help me figure out why I am, or how I can fix it. – Patrick Kelly Jul 21 '19 at 19:43
  • Nice. Someone thought that pointing out, that the compiler is wrong was too harsh a comment. As it turns out, the compiler *was* wrong. You could have had an answer a day ago... – IInspectable Jul 23 '19 at 06:45

2 Answers2

1

Update 3: The hex output suggests that the source file has been misinterpreted somewhere along the compilation. Instead of using UTF-8, Windows Codepage 1252 has been used, which means the string has the wrong encoding in the compiled program. The stored byte sequence in the output file is therefore C3 90 C2 Bf C3 91 E2 82 AC C3 90 C2 B8 90 C2 B2 C3 90 C2 B5 C3 91 E2 80 9A instead of the correct D0 BF D1 80 D0 B8 D0 B2 D0 B5 D1 82.

How to solve this problem depends on the toolchain. The MSVC has the /utf-8 flag to set the source and execution charset. You might think that this is quite redundant since you've already saved your source file as UTF-8? Turns out WordPad isn't the only software that requires a BOM to detect UTF-8. The following excerpt from the documentation explains the reason for the whole encoding problem.

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option.

In Visual Studio 17 you can also configure the charset by setting Character Set in Configuration Properties > General > Project Defaults. If you use cmake you will likely not encounter this problem because it configures everything properly out of the box.

Update 2: Some editors may not be able to deduce that the content is UTF-8 from a short byte sequence like this, which will result in the garbled output you've seen. You could add the UTF-8 byte order mark (BOM) at the beginning of the file to help these editors, although it's not considered a best practice since it conflates metadata and content, breaks ASCII backward compatibility and UTF-8 can be properly detected without it. It's mostly legacy software like Microsoft's WordPad that needs the BOM to interpret the file as UTF-8.

if (WriteFile(file, "\xef\xbb\xbf", 3, NULL, NULL) == 0) { goto error; }

Update: Code with a bit of basic error handling:

#include <windows.h>
#include <fileapi.h>
#include <stringapiset.h>

int main() {
    int ret_val = -1;

    const wchar_t source[] = L"привет";

    HANDLE file = CreateFileW(L"test.txt", GENERIC_ALL, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

    if (file == INVALID_HANDLE_VALUE) { goto error_0; }

    size_t required_size = WideCharToMultiByte(CP_UTF8, 0, source, -1, NULL, 0, NULL, NULL);

    if (required_size == 0) { goto error_0; }

    char *buffer = calloc(required_size, sizeof(char));

    if (buffer == NULL) { goto error_0; }

    if (WideCharToMultiByte(CP_UTF8, 0, source, -1, buffer, required_size, NULL, NULL) == 0) { goto error_1; }

    if (WriteFile(file, buffer, required_size - 1, NULL, NULL) == 0) { goto error_1; }

    if (CloseHandle(file) == 0) { goto error_1; }

    ret_val = 0;

error_1:
    free(buffer);

error_0:
    return ret_val;
}

Old: You can do the following which will create the file just fine. The first call to WideCharToMultiByte is used to determine the number of bytes required to store the UTF-8 string. Make sure to save the source file as UTF-8 otherwise the source string will not be properly encoded in the source file.

The following code is just a quick and dirty example and lacks rigorous error handling.

#include <windows.h>
#include <fileapi.h>
#include <stringapiset.h>

int main() {
    HANDLE file = CreateFileW(L"test.txt", GENERIC_ALL, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    const wchar_t source[] = L"привет";

    size_t required_size = WideCharToMultiByte(CP_UTF8, 0, source, -1, NULL, 0, NULL, NULL);

    char *buffer = (char *) calloc(required_size, sizeof(char));

    WideCharToMultiByte(CP_UTF8, 0, source, -1, buffer, required_size, NULL, NULL);
    WriteFile(file, buffer, required_size - 1, NULL, NULL);
    free(buffer);
    return CloseHandle(file);
}
  • Missing a "free". – zett42 Jul 21 '19 at 16:24
  • @zett42 true, also all kinds of error handling. This was just a quick hack to show that it works. –  Jul 21 '19 at 16:27
  • Copied exactly and still got "привет" – Patrick Kelly Jul 21 '19 at 16:41
  • 1
    @PatrickKelly It tested this and seems like your editor isn't interpreting the content as UTF-8. When I open the file in WordPad I get the same garbled output, Editor and Notepad++ work fine. –  Jul 21 '19 at 18:07
  • @PatrickKelly Can you check if the BOM helps you to display the file correctly? –  Jul 21 '19 at 18:23
  • I don't know why you're so keen that I'm using WordPad to edit my files when the attached screenshot looks nothing like WordPad. I'm using VS Code, and in the attached screenshot I clearly showed it interpreting as UTF-8 but still displaying garbage. – Patrick Kelly Jul 21 '19 at 19:50
  • Copied your old example into a brand new file created in Notepad++, compiled it, and opened the output in Nodepad++, result was exactly the same. – Patrick Kelly Jul 21 '19 at 19:51
  • Copied your new example into Notepad++ as well, and repeated the process. Output is the same, but with a slightly larger binary size. – Patrick Kelly Jul 21 '19 at 19:52
  • @PatrickKelly It's a lot easier to see the problems if the file output is shown as hex, because it removes all ambiguity as to whether the content is incorrect or if the editor messes up with the interpretation. –  Jul 21 '19 at 20:09
  • Added the setlocale() and got the same output again. I've attached the hex dump. – Patrick Kelly Jul 21 '19 at 20:13
  • You didn't ask what I was using either. And as I have said several times now, I included in the screenshot that the editor did recognize it as UTF-8. Incorrectly recognizing the encoding wasn't the issue; that's one of the first things I checked before posting here. – Patrick Kelly Jul 21 '19 at 20:15
  • @PatrickKelly Thank you. Interesting. The hex output is also way too long. I'll have to think about this some more. Sorry for not having an answer right away. –  Jul 21 '19 at 20:23
  • That actually fixed it. So it looks like it need to be explicitly told with /utf-8 because even the BOM didn't do the trick. So it was the compiler and never the editor being the issue. Why then was it problematic for this when the other answer worked fine without compilation changes? That makes very little sense to me. – Patrick Kelly Jul 22 '19 at 12:32
  • When I did my tests I just used his conversion part, and still wrote to the same file. File was opened the same way. Even written the same way. Just used the conversion. – Patrick Kelly Jul 22 '19 at 12:52
  • 1
    @Patrick Ah right. I think I know what's going on. In both our codes the string would be encoded wrong, however in his case the the setlocale would likely result in using 1252 AND wcstombs interprets the wchar_t* using said locale. WideCharToMultiByte on the other had expects a specific input encoding UTF-16. See https://stackoverflow.com/questions/10752591/how-to-use-wcstombs-instead-of-widechartomultibyte Which is also the reason for the note: https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte –  Jul 22 '19 at 13:00
  • 1
    @PatrickKelly This is why encoding is far from easy. I hope you're satisfied with that answer. –  Jul 22 '19 at 13:10
0

Typically there are two completely separate parts to this, and getting your display environment to properly display the resulting UTF-8 encodings.

Here's the straight C answer. (I can't help you with the Windows-specific stuff.)

I rewrote your program like this:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>

int main()
{
    const wchar_t source[] = L"привет";
    char utf8[30];
    int n;
    setlocale(LC_ALL, "");
    n = wcstombs(utf8, source, sizeof(utf8));
    printf("%.*s\n", n, utf8);
}

wcstombs is the Standard C function for converting a wide-character string to a "multibyte" string such as UTF-8; I presume WideCharToMultiByte is the Windows-specific equivalent.

Since wcstombs can theoretically perform multiple different potential conversions, it's important to set the "locale" correctly. In my environment (which is not Windows), my locale is set to "en_US.UTF-8". That line

setlocale(LC_ALL, "");

says that in this C program, I'm electing to use the locale as set in my environment (instead of using the default "C" locale).

And then when I run this program, in my environment which is set up to display UTF-8-encoded program output correctly, I see the output "привет" displayed, as expected.

I was afraid it might be harder for you (whether you use wcstombs or WideCharToMultiByte), because under some versions of Windows I gather it required a certain amount of effort to get UTF-8 to display properly. But from what you've added in a comment it sounds like that part's working fine.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • This works exactly as it should. Seems like the problem had to do with how I was calling `WideCharToMultiByte`, but since this works I'll just stick with `wcstombs`. – Patrick Kelly Jul 21 '19 at 16:17
  • I'd like to add that, at least on modern Windows, UTF-8 works fine. I think a lot of the criticisms of UTF-8 support go back to the days of XP and people haven't revisited it at all. – Patrick Kelly Jul 21 '19 at 16:18
  • 1
    @PatrickKelly Oh! Glad to hear it. My "rewrite" also failed to be a drop-in replacement in that it printed to the screen while yours wrote to a file, but I guess you can deal with that. I'm glad to hear about the Windows support situation, too; I'll soften that part of my answer. – Steve Summit Jul 21 '19 at 16:18
  • Yes, console UTF-8 just requires an additional function call before the write, just to let the system know it's UTF-8. I think people beleive it to be broken because that call is not at all required on most UNIX systems. – Patrick Kelly Jul 21 '19 at 16:20