Converting Denormalized Characters with UTF8String

Question

When converting emoji encoded in UTF-8 to string we did not get the correct characters using UTF8ToString. We receive these UTF8 characters from an external interface. We tested the UTF characters with an online UTF8 decoder and saw that they contain the correct characters. I suspect these are composite characters.

procedure TestUTF8Convertion;
const
  utf8Denormalized: RawByteString = #$ED#$A0#$BD#$ED#$B8#$85#$20 + #$ED#$A0#$BD#$ED#$B8#$86#$20 + #$ED#$A0#$BD#$ED#$B8#$8A;
  utf8Normalized: RawByteString = #$F0#$9F#$98#$85 + #$F0#$9F#$98#$86 + #$F0#$9F#$98#$8A;
begin
  Memo1.Lines.Add(UTF8ToString(utf8Denormalized));
  Memo1.Lines.Add(UTF8ToString(utf8Normalized));
end;

Output in Memo1:

Denormalized: ��

Normalized:

Writing the own conversion function based on the WinApi function MultiByteToWideChar did not solve this issue.

function UTF8DenormalizedToString(s: PAnsiChar): string;
var
  pwc: PWideChar;
  len: cardinal;
begin
  GetMem(pwc, (Length(s) + 1) * SizeOf(WideChar));
  len := MultiByteToWideChar(CP_UTF8, MB_PRECOMPOSED, @s[0], -1, pwc, length(s));
  SetString(result, pwc, len);
  FreeMem(pwc);
end;

`#$ED#$A0#$BD` is within the range "Non Private Use High Surrogate" and `#$ED#$B8#$85` is within the range "Low Surrogate" in UTF-8 and will never make sense on its own. The remaining `#$20` ist just a space. See https://stackoverflow.com/a/51051607/4299358 — AmigoJack, Aug 25 '20 at 09:34
What I do not understand: Both UTF-8 sequences `#$ED#$A0#$BD` and `#$ED#$B8#$85` shows this glyph: �. (I tried the following UTF8Decoder: https://mothereff.in/utf-8) while the concatenated sequence `\xED\xA0\xBD\xED\xB8\x85` shows the expected emoji glyph: — Schneider Infosystems Ltd, Aug 25 '20 at 13:15
Back to my question: How can I convert this UTF-8 sequence `#$F0#$9F#$98#$85` so that I get the emoji `U+1F605`? — Schneider Infosystems Ltd, Aug 25 '20 at 13:24

Remy Lebeau · Answer 1 · 2020-08-25T16:42:19.713

#$ED#$A0#$BD is the UTF-8 encoded form of Unicode codepoint U+D83D, which is a high surrogate.

#$ED#$B8#$85 is the UTF-8 encoded form of Unicode codepoint U+DE05, which is a low surrogate.

#$F0#$9F#$98#$85 is the UTF-8 encoded form of Unicode codepoint U+1F605.

Unicode codepoints in the surrogate range are reserved for UTF-16 and illegal to use on their own, which is why you see � when printed.

Those surrogates happen to be the proper UTF-16 surrogates for Unicode codepoint U+1F605 ().

So, what you have is a double-encoding issue that needs to be fixed at the source where the UTF-8 data is being generated. U+1F605 is first being encoded to UTF-16, not UTF-8, and then its surrogates are being mistreated as Unicode codepoints and individually encoded to UTF-8. What you want instead is for codepoint U+1F605 to be encoded as-is directly to UTF-8.

If you can't fix the source of the UTF-8 data, then you will just have to manually detect this malformed encoding and handle the data as UTF-16 instead. Decode the UTF-8 data to UTF-32, and if the result contains any surrogate codepoints then create a separate UTF-16 string of the same length and copy the codepoints as-is into that string, truncating their values to 16-bit. Then you can use that UTF-16 string as needed. Otherwise, if no surrogates are present then you can decode the UTF-8 directly to a UTF-16 string normally and use that result instead.

UPDATE: as mentioned in @AmigoJack's answer, this data is using CESU-8 encoding (is that documented in the source interface?). So, knowing this now, you can just forgo the manual detection and assume that all UTF-8 data from this source is CESU-8 and decode it manually as I described above (neither MultiByteToWideChar() nor the Delphi RTL will be able to handle it automatically for you), at least until the interface gets fixed, eg:

function UTF8DenormalizedToString(s: PAnsiChar): UnicodeString;
var
  utf32: UCS4String;
  len, i: Integer;
begin
  utf32 := ... decode utf8 to utf32 ...; // I leave this as an exercise for you!
  len := Length(utf32) - 1; // UCS4String includes a null terminator
  SetLength(Result, len);
  for i := 1 to len do
    Result[i] := WideChar(utf32[i-1] and $FFFF); // UCS4String is 0-indexed
end;

Thanks for the valuable information. Unfortunately, I can't correct the source that is generating this UTF 8. In the solution shown, I don't yet see how to decode UTF8 to UCS4. I have only found complicated C solutions that do this conversion according to the UTF-8 notation with complex case distinctions. My first attempt to translate that into pascal was not working. Is there no ready-made Delphi conversion for this? — Schneider Infosystems Ltd, Aug 26 '20 at 07:07
@SchneiderInfosystemsLtd UTF-8 is fairly easy to decode manually, there are numerous examples available on StackOverflow (I've posted several myself). I just didn't have time to write it up here yet. Maybe I'll add something tomorrow — Remy Lebeau, Aug 26 '20 at 07:23

score 2 · Answer 2 · answered Aug 25 '20 at 16:27

UTF-8 consists of 1, 2, 3, or 4 bytes per character. The codepoint U+1F605 is correctly encoded as #$F0#$9F#$98#$85.
UTF-16 consists of 2 or 4 bytes per character. The 4 byte sequences are needed to encode codepoints beyond U+FFFF (such as most Emojis). Only UCS-2 is limited to codepoints U+0000 to U+FFFF (this applies to Windows NT versions before 2000).
A sequence like #$ED#$A0#$BD#$ED#$B8#$85 (UTF-8 high surrogate, followed by low surrogate) is no valid UTF-8, but instead CESU-8 - it results from naive, thus improper translation from UTF-16 to UTF-8: instead of (recognizing and) translating a 4 byte UTF-16 sequence (encoding one codepoint) into a 4 byte UTF-8 sequence only and always 2 bytes are translated, turning 2x2 bytes into an invalid 6 byte UTF-8 sequence.

Converting your valid UTF-8 sequence #$F0#$9F#$98#$85 into the valid UTF-16 sequence #$3d#$d8#$05#$de works for me. Of course, make sure you use a proper font which is actually able to render Emojis:

// const CP_UTF8= 65001;

function Utf8ToUtf16( const sIn: AnsiString; iSrcCodePage: DWord= CP_UTF8 ): WideString;
var
  iLenDest, iLenSrc: Integer;
begin
  // First calculate how much space is needed
  iLenSrc:= Length( sIn );
  iLenDest:= MultiByteToWideChar( iSrcCodePage, 0, PAnsiChar(sIn), iLenSrc, nil, 0 );

  // Now provide the accurate space
  SetLength( result, iLenDest );
  if iLenDest> 0 then begin  // Otherwise ERROR_INVALID_PARAMETER might occur
    if MultiByteToWideChar( iSrcCodePage, 0, PAnsiChar(sIn), iLenSrc, PWideChar(result), iLenDest )= 0 then begin
      // GetLastError();
      result:= '';
    end;
  end;
end;

...
  Edit1.Font.Name:= 'Segoe UI Symbol';  // Already available in Win7
  Edit1.Text:= Utf8ToUtf16( AnsiString(#$F0#$9F#$98#$85' vs. '#$ED#$A0#$BD#$ED#$B8#$85) );
  // Should display:  vs. ����

To my knowledge Windows neither has a codepage for CESU-8, nor for WTF-8 and as such won't deal with your invalid UTF-8. Also the usage of MB_PRECOMPOSED is discouraged and does not apply to this case anyway.

Talk to whoever gives you invalid UTF-8 and demand to make his job correct (or to give you the UTF-16 right away). Otherwise you must pre-process incoming UTF-8 by scanning it for matching surrogate pairs to then replace those bytes into a proper sequence. Not impossible, not even that difficult, but a dull work of patience.

score 2 · Accepted Answer · answered Aug 26 '20 at 16:09

If you have CESU-8 data in a buffer and you need to convert it to UTF-8 you can replace the surrogate pairs with a single UTF-8 encoded char. The rest of the data can be left unchanged.

In this case, your emoji is this :

code point : 01 F6 05
UTF-8 : F0 9F 98 85
UTF-16 : D8 3D DE 05
CESU-8 : ED A0 BD ED B8 85

The high surrogate in CESU-8 has this data : $003D

And the low surrogate in CESU-8 has this data : $0205

As Remy and AmigoJack pointed out you'll find these values when you decode the UTF-16 version of the emoji.

In the case of UTF-16 you will also need to multiply the $003D value by $400 (shl 10), add the result to $0205 and then add $10000 to the final result to get the code point.

Once you have the code point you can convert it to a 4-byte UTF-8 set of values.

function ValidHighSurrogate(const aBuffer: array of AnsiChar; i: integer): boolean;
var
  n: byte;
begin
  Result := False;
  if (ord(aBuffer[i]) <> $ED) then
    exit;

  n := ord(aBuffer[i + 1]) shr 4;
  if ((n and $A) <> $A) then
    exit;

  n := ord(aBuffer[i + 2]) shr 6;
  if ((n and $2) = $2) then
    Result := True;
end;

function ValidLowSurrogate(const aBuffer: array of AnsiChar; i: integer): boolean;
var
  n: byte;
begin
  Result := False;
  if (ord(aBuffer[i]) <> $ED) then
    exit;

  n := ord(aBuffer[i + 1]) shr 4;
  if ((n and $B) <> $B) then
    exit;

  n := ord(aBuffer[i + 2]) shr 6;
  if ((n and $2) = $2) then
    Result := True;
end;

function GetRawSurrogateValue(const aBuffer: array of AnsiChar; i: integer): integer;
var
  a, b: integer;
begin
  a := ord(aBuffer[i + 1]) and $0F;
  b := ord(aBuffer[i + 2]) and $3F;

  Result := (a shl 6) or b;
end;

function CESU8ToUTF8(const aBuffer: array of AnsiChar): boolean;
var
  TempBuffer: array of AnsiChar;
  i, j, TempLen: integer;
  TempHigh, TempLow, TempCodePoint: integer;
begin
  TempLen := length(aBuffer);
  SetLength(TempBuffer, TempLen);

  i := 0;
  j := 0;
  while (i < TempLen) do
    if (i + 5 < TempLen) and ValidHighSurrogate(aBuffer, i) and
      ValidLowSurrogate(aBuffer, i + 3) then
    begin
      TempHigh := GetRawSurrogateValue(aBuffer, i);
      TempLow := GetRawSurrogateValue(aBuffer, i + 3);
      TempCodePoint := (TempHigh shl 10) + TempLow + $10000;
      TempBuffer[j] := AnsiChar($F0 + ((TempCodePoint and $1C0000) shr 18));
      TempBuffer[j + 1] := AnsiChar($80 + ((TempCodePoint and $3F000) shr 12));
      TempBuffer[j + 2] := AnsiChar($80 + ((TempCodePoint and $FC0) shr 6));
      TempBuffer[j + 3] := AnsiChar($80 + (TempCodePoint and $3F));
      inc(j, 4);
      inc(i, 6);
    end
    else
    begin
      TempBuffer[j] := aBuffer[i];
      inc(i);
      inc(j);
    end;

  Result := < save the buffer here >;
end;

If those `Valid*()` functions look too complicated then https://stackoverflow.com/a/34156887 sums it up nicely. "Correcting" invalid UTF-8 is easy: just search for `#$ed` and then compare the following bytes. — AmigoJack, Aug 26 '20 at 20:28

Converting Denormalized Characters with UTF8String

3 Answers3

Linked