I am trying to better understand surrogate pairs and Unicode implementation in Delphi.
If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.
This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively. This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.
If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that? I know I would need to do some sort of testing of the individual bytes. I ran some tests using the routine
function GetFirstCodepointSize(const S: UTF8String): Integer;
referenced in this SO Question.
but got some unusual results, eg, here are some length and sizes of some different codepoints. Below is a snippet of how I generated these tables.
...
UTFCRUDResultStrings.add('INPUT: '+#9#9+ DATA +#9#9+ 'GetFirstCodePointSize = ' +intToStr(GetFirstCodepointSize(DATA))
+#9#9+ 'Length =' + intToStr(length(DATA)));
...
First Set: This makes sense to me, each code point size is doubled, but these are one character each and Delphi gives me the length as just 1, perfect.
INPUT: ď GetFirstCodePointSize = 2 Length =1
INPUT: ơ GetFirstCodePointSize = 2 Length =1
INPUT: ǥ GetFirstCodePointSize = 2 Length =1
Second set: It initially looks to me like the lengths and code points are reversed? I am guessing the reason for this is that the characters + surrogates are being treated individually, hence the first codepoint size is for the 'H', which is 1, but the length is returning the lengths of 'H' plus '^'.
INPUT: Ĥ GetFirstCodePointSize = 1 Length =2
INPUT: à̲ GetFirstCodePointSize = 1 Length =3
INPUT: V̂ GetFirstCodePointSize = 1 Length =2
INPUT: e GetFirstCodePointSize = 1 Length =1
Some additional tests...
INPUT: ¼ GetFirstCodePointSize = 2 Length =1
INPUT: ₧ GetFirstCodePointSize = 3 Length =1
INPUT: GetFirstCodePointSize = 4 Length =2
INPUT: ß GetFirstCodePointSize = 2 Length =1
INPUT: GetFirstCodePointSize = 4 Length =2
Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends?
I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.