Detecting and Retrieving codepoints and surrogates from a Delphi String

Question

I am trying to better understand surrogate pairs and Unicode implementation in Delphi.

If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.

This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively. This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.

If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that? I know I would need to do some sort of testing of the individual bytes. I ran some tests using the routine

function GetFirstCodepointSize(const S: UTF8String): Integer;

referenced in this SO Question.

but got some unusual results, eg, here are some length and sizes of some different codepoints. Below is a snippet of how I generated these tables.

...
UTFCRUDResultStrings.add('INPUT: '+#9#9+ DATA +#9#9+ 'GetFirstCodePointSize = ' +intToStr(GetFirstCodepointSize(DATA))
+#9#9+ 'Length =' + intToStr(length(DATA)));
...

First Set: This makes sense to me, each code point size is doubled, but these are one character each and Delphi gives me the length as just 1, perfect.

INPUT:      ď       GetFirstCodePointSize = 2       Length =1
INPUT:      ơ       GetFirstCodePointSize = 2       Length =1
INPUT:      ǥ       GetFirstCodePointSize = 2       Length =1

Second set: It initially looks to me like the lengths and code points are reversed? I am guessing the reason for this is that the characters + surrogates are being treated individually, hence the first codepoint size is for the 'H', which is 1, but the length is returning the lengths of 'H' plus '^'.

INPUT:      Ĥ      GetFirstCodePointSize = 1       Length =2
INPUT:      à̲     GetFirstCodePointSize = 1       Length =3
INPUT:      V̂      GetFirstCodePointSize = 1       Length =2
INPUT:      e       GetFirstCodePointSize = 1       Length =1

Some additional tests...

INPUT:      ¼       GetFirstCodePointSize = 2       Length =1
INPUT:      ₧       GetFirstCodePointSize = 3       Length =1
INPUT:            GetFirstCodePointSize = 4       Length =2
INPUT:      ß       GetFirstCodePointSize = 2       Length =1
INPUT:            GetFirstCodePointSize = 4       Length =2

Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends?

I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.

*Could someone implement the following function?* This is not a code writing service, where you post your requirements and someone churns out the code to meet them. Make your best effort to write it yourself. If you run into difficulty, post the code you've written, explain how it doesn't work as you expect, and ask a **specific question** about that code, and we can try to help you. *Please give me the code* isn't a valid question here. — Ken White, Aug 15 '15 at 01:11

score 17 · Accepted Answer · answered Aug 15 '15 at 01:28

I am trying to better understand surrogate pairs and Unicode implementation in Delphi.

Let's get some terminology out of the way.

Each "character" (known as a grapheme) that is defined by Unicode is assigned a unique codepoint.

In a Unicode Transformation Format (UTF) encoding - UTF-7, UTF-8, UTF-16, and UTF-32 - each codepoint is encoded as a sequence of codeunits. The size of each codeunit is determined by the encoding - 7 bits for UTF-7, 8 bits for UTF-8, 16 bits for UTF-16, and 32 bits for UTF-32 (hence their names).

In Delphi 2009 and later, String is an alias for UnicodeString, and Char is an alias for WideChar. WideChar is 16 bits. A UnicodeString holds a UTF-16 encoded string (in earlier versions of Delphi, the equivalent string type was WideString), and each WideChar is a UTF-16 codeunit.

In UTF-16, a codepoint can be encoded using either 1 or 2 codeunits. 1 codeunit can encode codepoint values in the Basic Multilingual Plane (BMP) range - $0000 to $FFFF, inclusive. Higher codepoints require 2 codeunits, which is also known as a surrogate pair.

If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.

This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively.

This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.

Yes, there are 8 WideChar elements (codeunits) in your UTF-16 UnicodeString. What you are calling "surrogates" are actually known as "combining marks". Each combining mark is its own unique codepoint, and thus its own codeunit sequence.

If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that?

You have to start at the beginning of the UnicodeString and analyze each WideChar until you find one that is not a combining mark attached to a previous WideChar. On Windows, the easiest way to do that is to use the CharNextW() function, eg:

var
  S: String;
  P: PChar;
begin
  S := 'Ĥà̲V̂e';
  P := CharNext(PChar(S)); // returns a pointer to  à̲
end;

The Delphi RTL does not have an equivalent function. You would have write one manually, or use a third-party library. The RTL does have a StrNextChar() function, but it only handles UTF-16 surrogates, not combining marks (CharNext() handles both). So, you could use StrNextChar() to scan through each codepoint in the UnicodeString, but you have to loo at each codepoint to know whether it is a combining mark or not, eg:

uses
  Character;

function MyCharNext(P: PChar): PChar;
begin
  if (P <> nil) and (P^ <> #0) then
  begin
    Result := StrNextChar(P);
    while GetUnicodeCategory(Result^) = ucCombiningMark do
      Result := StrNextChar(Result);
  end else begin
    Result := nil;
  end;
end;

var
  S: String;
  P: PChar;
begin
  S := 'Ĥà̲V̂e';
  P := MyCharNext(PChar(S)); // should return a pointer to  à̲
end;

I know I would need to do some sort of testing of the individual bytes.

Not the bytes, but the codepoints that they represent when decoded.

I ran some tests using the routine

function GetFirstCodepointSize(const S: UTF8String): Integer

Look closely at that function signature. See the parameter type? It is a UTF-8 string, not a UTF-16 string. This was even stated in the answer you got that function from:

Here is an example how to parse UTF8 string

UTF-8 and UTF-16 are very different encodings, and thus have different semantics. You cannot use UTF-8 semantics to process a UTF-16 string, and vice versa.

Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends?

Not directly. You have to parse the string from the beginning, skipping elements as needed until you reach the desired element. Remember that each codepoint may be encoded as either 1 or 2 codeunit elements, and each logical glyph may be encoded using multiple codepoints (and thus multiple codeunit sequences).

I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.

1 glyph is comprised of 1+ codepoints, and each codepoint is encoded as 1+ codeunits.

Could someone implement the following function?

function GetElementAtIndex(S: String; StrIdx : Integer): String;

Try something like this:

uses
  SysUtils, Character;

function MyCharNext(P: PChar): PChar;
begin
  Result := P;
  if Result <> nil then
  begin
    Result := StrNextChar(Result);
    while GetUnicodeCategory(Result^) = ucCombiningMark do
      Result := StrNextChar(Result);
  end;
end;

function GetElementAtIndex(S: String; StrIdx : Integer): String;
var
  pStart, pEnd: PChar;
begin
  Result := '';
  if (S = '') or (StrIdx < 0) then Exit;
  pStart := PChar(S);
  while StrIdx > 1 do
  begin
    pStart := MyCharNext(pStart);
    if pStart^ = #0 then Exit; 
    Dec(StrIdx);
  end;
  pEnd := MyCharNext(pStart);
  {$POINTERMATH ON}
  SetString(Result, pStart, pEnd-pStart);
end;

thank you for all of the detail. This also makes clear that indexing a utf16 string, eg., S [i] will not always work as expected, given that the char itself may or may not have combining marks and may not fit into a widechar. Thank you for helping me understand this better. — sse, Aug 16 '15 at 06:02
I do believe that an automatic conversion occurs from utf16 to utf8 in the function getFirstCodePointSize. I will try to find a reference. Thanks again. — sse, Aug 16 '15 at 06:06
Yes, there is an automatic conversion when assigning one string type to another. `UTF8String` and `UnicodeString` are separate string types. `getFirstCodePointSize()` takes a `UTF8String` as input, so it is going to return information related to UTF-8, not UTF-16. In this case, it returns the number of 8bit codeunits used to encode the first codepoint in the UTF-8 string. UTF-8 encodes a codepoint using either 1, 2, 3, or 4 8bit codeunits. As I said earlier, UTF-16 encodes a codepoint using 1 or 2 16bit codeunits. That is why I said you cannot use UTF-8 semantics to process a UTF-16 string. — Remy Lebeau, Aug 16 '15 at 06:24
One other takeaway, that I hope is true. Is that I will get the total number of bytes in a UTF16 string, if I multiply its length by SizeOf(Char), eg., totalBytes = Length(S)*SizeOf(Char), will always give me the exact number of bytes in the UTF16 String, regardless of whether or not there are surrogate pairs or Combining Marks and even if the character is NOT on the BMP. I wonder because code abounds that indicates we can get the number of bytes in a UTF16 string simply by multiplying its length by size of WideChar. I just want to be sure this is always true. Thank you again. :) — sse, Aug 17 '15 at 18:53
Yes, `Length(S)*SizeOf(Char)` is the total byte count of a `String`. For D2009+, `String=UnicodeString` and `Char=WideChar`. The RTL has a `ByteLength()` function in the `SysUtils` unit that performs that calculation for you. You can use a similar calculation for `UTF8String` (or any other `AnsiString-based` string type) by multiplying the `Length()` by `SizeOf(AnsiChar)` (ie: 1) instead. — Remy Lebeau, Aug 17 '15 at 23:34

vfbb · Answer 2 · 2022-02-19T01:22:19.667

Looping through the graphemes of a string can be more complicated than you might think. In Unicode 13, some graphemes are up to 14 bytes long. I advise using a third-party library for this. One of the best for this is Skia4Delphi: https://github.com/skia4delphi/skia4delphi

The code is very simple:

  var LUnicode: ISkUnicode := TSkUnicode.Create;
  for var LGrapheme: string in LUnicode.GetBreaks('Text', TSkBreakType.Graphemes) do
    Showmessage(LGrapheme);

In the library demo itself there is an example of graphemes iterator too. Look:

Detecting and Retrieving codepoints and surrogates from a Delphi String

2 Answers2

Linked

Related