Converting Encodings

Question

I'm using the Win32 API MultiByteToWideChar() function to convert any encoding into Wide characters. The issue is, I will be streaming data in. For example, I could read a chunk of fixed width data into a buffer, and then call that function.

The issue is when that chunk is in between a multi-byte character, then MultiByteToWideChar() would fail.

My question is, how do I get the index of the last full character in the buffer?

I suppose I could try again with a shortened buffer every single time the function fails, but with large buffers this is extremely inefficient.

I wanted to do this because I tried out both ICONV and ICU. ICONV was slower than the .NET decoder class, so I implemented that in C++. Then, I found out that ICU was faster than the .NET decoder. Then, I figured out MultiByteToWideChar() is the fastest.

For UTF8, by design, you can identify the first byte in an encoded code point. So it's easy to start at then end the f the buffer and find the right place to chop. Other multi byte encodings are not so amenable. — David Heffernan, Oct 06 '20 at 05:42
Yeah. I wanted to do this because I tried out both Iconv AND ICU. Iconv was slower than the .NET decoder class, so I implemented that in C++. Then, I found out ICU was faster than .NET decoder. Then, I figured out MultibyteToWideChar is the fastest. — TesterMan123, Oct 06 '20 at 05:44
Are you sure that decoding is the bottleneck in your program? — David Heffernan, Oct 06 '20 at 05:47
Yeah, I'm sure. I'm actually pretty happy with ICU performance, just it takes around 6.5MB of app size for the most stripped down version :( — TesterMan123, Oct 06 '20 at 05:49
In the general case, I don't think it's possible. This SO question may give you some answers: https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream and in fact may convince you to stick with the brute-force-try-again approach in the general case and optimize/specify for some cases like UTF8, etc. — Simon Mourier, Oct 06 '20 at 06:43
What do you mean "full character" ? Last code point? Or really full character (which may include multiple code points)? Go near the end and check next code point, Then check if it is combining, skip to next code point ... The algorithm is described in Unicode — Giacomo Catenazzi, Oct 06 '20 at 06:59
@GiacomoCatenazzi I will try your solution. Because MultiByteToWideChar will still convert the characters, but the last code point might be invalid. So I can start search from the end. — , Oct 06 '20 at 15:14
@TesterMan123 FYI that ICU is now [built-in to Windows 10](https://learn.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu-), so you don't need to include it in your app anymore. You might consider compiling ICU into a DLL, and then you don't need to deploy that DLL on Win10+. — Remy Lebeau, Oct 06 '20 at 17:47
@TesterMan123 Unlike ICONV and ICU, `MultiByteToWideChar()` doesn't stop converting when the buffer is in the middle of a codepoint. The best you could do about that is omit the `MB_ERR_INVALID_CHARS` flag and detect when the output produces `U+FFFD` characters on the end, and if so then back up a little and try again. Otherwise, use `MultiByteToWideChar()` to convert the buffer 1 codepoint at a time, and stop when it fails. — Remy Lebeau, Oct 06 '20 at 17:50
@RemyLebeau I realize that ICU is built into Windows 10, however my app will be used on older versions of windows as well (at least windows 7) — , Oct 06 '20 at 23:28

Converting Encodings

0 Answers0