1

I have 3 bytes representing an unicode char encoded in utf8. For example I have E2 82 AC (UTF8) that represent the unicode char € (U+20AC). Is their any algorithm to make this conversion? I know their is the windows api MultiByteToWideChar but I would like to know if their is a simple mathematical relation between E2 82 AC and U+20AC. So is the mapping between utf8 -> utf16 a simple mathematic function or if it's a hardcoded map.

zeus
  • 12,173
  • 9
  • 63
  • 184
  • UTF-8 and UTF-16 are encodings of unicode code points. So `UTF16 = EncodeUTF16(DecodeUTF8())`. Encoding takes a 32 bit code point and converts it to the appropriate seq of bytes. While Decode takes a sequence of bytes and converts it to 32 bit code point. Is it simple yes, Is there a direct mapping no because both encodings are variable length (and that map would be huge). – Martin York Sep 17 '22 at 21:54
  • 4
    Wikipedia has a pretty good description of both [UTF-8](https://en.wikipedia.org/wiki/UTF-8) and [UTF-16](https://en.wikipedia.org/wiki/UTF-16). To transform one to the other, you must first decode the Unicode code point. – Dúthomhas Sep 17 '22 at 21:55
  • Decode the `char8_t` UTF-8 code unit sequence into a `char32_t` Unicode code point, then encode the `char32_t` Unicode code point into `char16_t` UTF-16 code unit sequence. – Eljay Sep 17 '22 at 22:00
  • 1
    Does this help? https://stackoverflow.com/a/148766/5987 – Mark Ransom Sep 17 '22 at 22:37

1 Answers1

15

Converting a valid UTF-8 byte sequence directly to UTF-16 is doable with a little mathematical know-how.

Validating a UTF-8 byte sequence is trivial: simply check that the first byte matches one of the patterns below, and that (byte and $C0) = $80 is true for each subsequent byte in the sequence.

The first byte in a UTF-8 sequence tells you how many bytes are in the sequence:

(byte1 and $80) = $00: 1 byte
(byte1 and $E0) = $C0: 2 bytes
(byte1 and $F0) = $E0: 3 bytes
(byte1 and $F8) = $F0: 4 bytes
anything else: error

There are very simple formulas for converting UTF-8 1-byte, 2-byte, and 3-byte sequences to UTF-16, as they all represent Unicode codepoints below U+10000, and thus can be represented as-is in UTF-16 using just one 16-bit codeunit, no surrogates needed, just some bit twiddling, eg:

1 byte:

UTF16 = UInt16(byte1 and $7F)

2 bytes:

UTF16 = (UInt16(byte1 and $1F) shl 6)
        or UInt16(byte2 and $3F)

3 bytes:

UTF16 = (UInt16(byte1 and $0F) shl 12)
        or (UInt16(byte2 and $3F) shl 6)
        or UInt16(byte3 and $3F)

Converting a UTF-8 4-byte sequence to UTF-16, on the other hand, is slightly more involved, since it represents a Unicode code point that is U+10000 or higher, and thus will need to use UTF-16 surrogates, which requires some additional math to calculate, eg:

4 bytes:

CP = (UInt32(byte1 and $07) shl 18)
     or (UInt32(byte2 and $3F) shl 12)
     or (UInt32(byte3 and $3F) shl 6)
     or UInt32(byte4 and $3F)
CP = CP - $10000
highSurrogate = $D800 + UInt16((CP shr 10) and $3FF)
lowSurrogate = $DC00 + UInt16(CP and $3FF)
UTF16 = highSurrogate, lowSurrogate

Now, with that said, let's look at your example: E2 82 AC

The first byte is ($E2 and $F0) = $E0, the second byte is ($82 and $C0) = $80, and the third byte is ($AC and $C0) = $80, so this is indeed a valid UTF-8 3-byte sequence.

Plugging in those byte values into the 3-byte formula, you get:

UTF16 = (UInt16($E2 and $0F) shl 12)
        or (UInt16($82 and $3F) shl 6)
        or UInt16($AC and $3F)

      = (UInt16($02) shl 12)
        or (UInt16($02) shl 6)
        or UInt16($2C)

      = $2000
        or $80
        or $2C

      = $20AC

And indeed, Unicode codepoint U+20AC is encoded in UTF-16 as $20AC.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770