The conversion from/to UTF-8/UTF-16 requires (ex: utf8 -> codepoint then codepoint to utf16) or (ex: utf8 -> utf16)?

Question

I have an inquiry about the conversion from/to utf8/utf16, does require to return the UTF-8/16 to its original codepoint first then convert to the target encoding or it is possible to convert from encoding to another directly, ex: utf16 to utf8 or visa versa.

For example, I have a character س its UTF-8 0xD8 0xB3, does require to convert from utf-8 to utf-16 to decode that to its codepoint U+0633 then encode it again to UTF-16 0x0633?

Since the encoding from the "codepoint", `U+0633`, to UTF-16, `0x0633` involves, er..., apparently doing absolutely nothing whatsoever, the question seems to be a moot point, isn't it? — Sam Varshavchik, Jun 15 '20 at 02:31
I don't have the answer. There probably is but I think there is a fair chance you would end up doing at least as much work. — Galik, Jun 15 '20 at 02:32

score 0 · Answer 1 · answered Jun 15 '20 at 04:37

If your UTF-8 code is less than 128, then you can immediately generate the UTF-16 equivalent. In a very real sense, though, you have decoded the entire UTF-8 character to its codepoint and re-encoded it in UTF-16. So we'd just be debating semantics as to whether that's going directly to the other encoding or not.

UTF-8 encodings up to three bytes have to be completely decoded, and the UTF-16 encoding is just that decoded value as two bytes. So have you actually re-encoded it into UTF-16 or did you convert to UTF-16 directly? It's really just a point of view.

The most complicated version is when the UTF-8 encoding is four bytes, since those represent codepoints beyond the BMP, so the UTF-16 encoding would be a surrogate pair. I don't think there's any computational shortcut to be taken there. If there were, it probably wouldn't be worth it. Such shortcuts could actually run slower on modern processors since you'd need extra conditional branch instructions, which could thwart branch prediction and pipelining.

I think you can make roughly the same argument in the reverse direction as well.

So I'm going to say, yes, you do have to convert to the actual codepoint when transcoding between UTF-8 and UTF-16.

"*I don't think there's any computational shortcut to be taken there.*" Sure there is, and you've already taken it. Figuring out which UTF-8 encoded characters would require surrogate pairs in UTF-16 is a big step, because it means you don't have to ask that question when doing the UTF-16 encoding part. The UTF-8 itself already told you the answer. So the overall process can be made more efficient by only doing surrogate pairs if it's a 4-byte UTF-8 encoding. — Nicol Bolas, Jun 15 '20 at 15:32
@NicolBolas: I didn't word that particular bit very well. The gist of my answer isn't about the efficiency, it's that you cannot get from one encoding to the other that doesn't involve computing the codepoint, which I believe was the intent of the question. — Adrian McCarthy, Jun 16 '20 at 15:26
Are there branches? But on distinguishing the length of UTF-8 sequence, the rest is just shift and masks (+ the first branch [single test] about initial BOM), but this is *problem* of UTF-8. And no, you must not decode completely the UTF-8: you can work one byte at a time (in input), and output could take actual and next byte. [useful e.g. in a 8bit processor, e.g. in a microcontroller] — Giacomo Catenazzi, Jun 17 '20 at 12:43

Nicol Bolas · Answer 2 · 2020-06-16T15:34:05.860

UTF-8's decoding algorithm works like this. You do up to 3 conditional tests against the first byte to figure out how many bytes to process, and then you process that number of bytes into a codepoint.

UTF-16's encoding algorithm works by taking the code point and checking to see if it is larger than 0xFFFF. If so, then you encode it into 2 16-bit surrogate pairs; otherwise, you encode it into a single 16-bit code unit.

Here's the thing though. Every codepoint larger than 0xFFFF is encoded in UTF-8 by 4 code units, and every codepoint 0xFFFF or smaller is encoded by 3 code units or less. Therefore, if you did UTF-8 decoding to produce the codepoint... you don't have to do the conditional test in the UTF-16 encoding algorithm. Based on how you decoded the UTF-8 sequence, you already know if the codepoint needs 1 16-bit code unit or two.

Therefore, in theory, a full UTF-8->utf-16 hand-coded algorithm could involve one less conditional test than using a direct codepoint intermediate. But really, that's the only difference. Even for 4-byte UTF-8 sequences, you have to extract the UTF-8 value into a full 32-bit codepoint before you can do the surrogate pair encoding. So the only real efficiency gain possible is the lack of the condition.

For UTF-16->UTF-8, you know that any surrogate pair encoding requires 4 bytes in UTF-8, and any non-surrogate pair encoding requires 3 or less. And you have to do that test before decoding UTF-16 anyway. But you still basically have to do all of the work to convert the UTF-16 to a codepoint before the UTF-8 encoder can do its job (even if that work is nothing, as is the case for non-surrogate pairs). So again, the only efficiency gain is from losing one conditional test.

These sound like micro-optimizations. If you do a lot of such conversions, and they're performance-critical, it might be worthwhile to hand-code a converter. Maybe.

score -1 · Answer 3 · answered Jun 15 '20 at 02:33

-1

Try the top answer to this question:

How to convert UTF-8 std::string to UTF-16 std::wstring?

Ignore the "C++11" answer as the STL calls made are deprecated.

answered Jun 15 '20 at 02:33

François Kupo

69
4

Thank you, I saw that question before, the best answer's code converts UTF-8 -> codepoint, then from codepoint to UTF-16 but I don't know whether it is the official/best way for conversion or not. – Lion King Jun 15 '20 at 03:27

score -1 · Answer 4 · answered Jun 15 '20 at 15:24

The easier way it is to decode to code-points and then encode with the wanted encoding. In this manner you manage surrogates, and special escapes (which are not really UTF-8, but they are sometime used, e.g. to include codepoint U+0 into an ASCIIZ/C-string.

If you write down UTF-8 <-> code point in bit form (and the same with UTF-16, Wikipedia helps), you see that bits keeps they values, so you can just move bits in a direct conversion, without passing to code point (and so without an intermediate variable). It is just shift and masks (and an addition/subtraction in UTF16). I would not doing it, but if it is a very performance critical task.

The conversion from/to UTF-8/UTF-16 requires (ex: utf8 -> codepoint then codepoint to utf16) or (ex: utf8 -> utf16)?

4 Answers4