How manipulate substrings, and not subarrays, of UnicodeString?

Question

I am testing migration from Delphi 5 to XE. Being unfamiliar with UnicodeString, before asking my question I would like to present its background.

Delphi XE string-oriented functions: Copy, Delete and Insert have a parameter Index telling where the operation should start. Index may have any integer value starting from 1 and finishing at the length of the string to which the function is applied. Since the string can have multi-element characters, function operation can start at an element (surrogate) belonging to a multi-element series encoding a single unicode named code-point. Then, having a sensible string and using one of the functions, we can obtain non sensible result.

The phenomenon can be illustrated with the below cases using the function Copy with respect to strings representing the same array of named codepoints (i.e. meaningful signs)

  ($61, $13000, $63)

It's concatenation of 'a', EGYPTIAN_HIEROGLYPH_A001 and 'c'; it looks as

enter image description here

Case 1. Copy of AnsiString (element = byte)

We start with the above mentioned UnicodeString #$61#$13000#$63 and we convert it to UTF-8 encoded AnsiString s0.

Then we test the function

  copy (s0, index, 1)

for all possible values of index; there are 6 of them since s0 is 6 bytes long.

    procedure Copy_Utf8Test;
    type TAnsiStringUtf8 = type AnsiString (CP_UTF8);
    var ss    : string;
        s0,s1 : TAnsiStringUtf8;
        ii    : integer;
    begin
      ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00
      s0 := ss;              //mem dump of s0: $61 $F0 $93 $80 $80 $63
      ii := length(s0);      //sets ii=6 (bytes)
      s1 := copy(s0,1,1);    //'a'
      s1 := copy(s0,2,1);    //#$F0  F means "start of 4-byte series"; no corresponding named code-point
      s1 := copy(s0,3,1);    //#$93  "trailing in multi-byte series"; no corresponding named code-point
      s1 := copy(s0,4,1);    //#$80  "trailing in multi-byte series"; no corresponding named code-point
      s1 := copy(s0,5,1);    //#$80  "trailing in multi-byte series"; no corresponding named code-point
      s1 := copy(s0,6,1);    //'c'
    end;

The first and last results are sensible within UTF-8 codepage, while the other 4 are not.

Case 2. Copy of UnicodeString (element = word)

We start with the same UnicodeString s0 := #$61#$13000#$63.

Then we test the function

  copy (s0, index, 1)

for all possible values of index; there are 4 of them since s0 is 4 words long.

    procedure Copy_Utf16Test;
    var s0,s1 : string;
        ii    : integer;
    begin
      s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00
      ii := length(s0);      //sets ii=4 (bytes)
      s1 := copy(s0,1,1);    //'a'
      s1 := copy(s0,2,1);    //#$D80C surrogate pair member; no corresponding named code-point
      s1 := copy(s0,3,1);    //#$DC00 surrogate pair member; no corresponding named code-point
      s1 := copy(s0,4,1);    //'c'
    end;

The first and last results are sensible within codepage CP_UNICODE (1200), while the other 2 are not.

Conclusion.

The string-oriented functions: Copy, Delete and Insert perfectly operate on string considered as a mere array of bytes or words. But they are not helpful if string is seen as that what it essentially is, i.e. representation of array of named code-points.

Both above two cases deal with strings which represent the same array of 3 named code-points. They are considered as representations (encodings) of the same text composed of 3 meaningful signs (to avoid abuse of the term "characters").

One may want to be able to extract (copy) any of those meaningful signs regardless whether a particular text representation (encoding) is mono- or multi-element one. I've spent quite a time looking around for a satisfactory equivalent of Copy that I used to in Delphi 5.

Question. Do such equivalents exist or I have to write them myself?

I suppose what I'm getting at is that surrogate pairs are only the tip of the iceberg. For example, what about composition. Consider `'e'#$0301` — David Heffernan, Sep 10 '14 at 08:22
@David. I suppose that you ask me to reveal heuristics underlying my question. If, according to you, my supposition is correct, then I would like to invite you to read once more the section **Conclusion** of my message. Believe me, I did my best - it took me about 10 hours - to make my message as comprehensive as I could, and **Conclusion** was the ultimate result of that effort. Of course, if you have a more specific question, I would willingly try to provide appropriate and more specific answer. — jkomorowski, Sep 10 '14 at 18:20
I'm trying to work out what you are looking for. I wonder how many meaningful signs you believe `'e'#$0301` to be. I suspect that you understand about surrogates but are not yet aware of the further nuances of Unicode. Even with UTF-32 where a character element corresponds directly to a code point, it can take multiple code points to define a single grapheme. Hence my raising the issue of composition. Do you understand me? — David Heffernan, Sep 10 '14 at 18:23
@David. Concerning `'e'#$0301`. Since I am trying *to learn* and *not to teach*, I can only share my outlook on the subject. If I have two UnicodeStrings: `s0:='e'#$0301` and `s1:=#$00E9` then not only they are 2 different strings, but also they encode 2 different arrays of named codepoints (_meaningful signs_): `("E LATIN SMALL LETTER", "ACCENT, COMBINING ACUTE")` and `("E WITH ACUTE, LATIN SMALL LETTER")`. Meanwhile, on the human language level of abstraction, both `s0` and `s1` are (semantically) equivalent; they represent (encode) the same letter of French alphabet. — jkomorowski, Sep 10 '14 at 18:27
I try to expose three levels of abstraction: (1) **encoding** level (where Delphi XE strings belong), (2) **meaningful signs** level (where I would like, in a sense, to do the **Copy**, **Delete** and **Insert** operations), and (3) **semantics** level (where you brought me with your question about diacriticals). — jkomorowski, Sep 10 '14 at 18:28
The problem I have with this is that "meaningful signs" is your terminology and I don't know what it means. With Unicode it pays to use the standard terminology. Perhaps by "meaningful sign" you mean "code point". However, in order to offer advice, it would help to know why you want to call functions like `Copy`, `Insert` etc. In my experience, in my programming, the issues that you raise never actually arise. I've never encountered a scenario where I've split a string in the middle of a surrogate pair. Can you provide an example where that might happen? — David Heffernan, Sep 10 '14 at 18:32
@David. Let me, please, an *off topic* remark. I have read several of your answers to other SO members and I would like to compliment you for your patience and courtesy. From my point of view, it's a noble complement to your technical skills. — jkomorowski, Sep 10 '14 at 18:53
Er thanks. I think there are a lot of people here who wouldn't describe me as courteous, probably correctly! You must have lucked out and got me on my good days! — David Heffernan, Sep 10 '14 at 18:55
@David. As I tried to indicate (not define formally), by "meaningful sign" I mean a codepoint endowed with a unicode name (i.e. named codepoint, also my jargon ;-) ). — jkomorowski, Sep 10 '14 at 19:03
All code points are named. Not all potentially valid code points have been named yet. But the standard gets nbew code points all the time. To which standard are you working? I still don't understand the motivation. Perhaps you just wish to learn which is good. — David Heffernan, Sep 10 '14 at 19:07
Regarding your comment to remy's answer, I see no real reason for you to write the functions you propose to write. I cannot see what use they would have. — David Heffernan, Sep 10 '14 at 19:10
The point is that, in practice, you don't find yourself extracting arbitrary substrings from arbitrary substrings. — David Heffernan, Sep 10 '14 at 19:16
@David. Concerning unnamed codepoints.www.unicode.org/charts/PDF/UD800.pdf reads: "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range." — jkomorowski, Sep 10 '14 at 19:28
@David. You are right, first of all, I try to learn. For the moment I am not convinced that it's impossible to need extract the 5-th "meaningful sign" from a string. I would be extremely happy to learn that I am wrong. — jkomorowski, Sep 10 '14 at 19:35
It depends on the type of programming you do. I've never needed that. If I were you I'd wait until you need this before coding it. — David Heffernan, Sep 10 '14 at 19:37
@David. It's very kind of you to spare me useless effort, thanks. Since you and Remy assert the same (in a sense), there should be an explanation to that. I have a vague idea what it could be but I prefer to think about it in details first. — jkomorowski, Sep 10 '14 at 20:15
Think when you would use Copy. Generally there would be a call to Pos to first locate the start index. But that would guarantee that you would not be splitting surrogates. You might get bitten by non-canonical representations, e.g. composition. Normalisation would deal with that. — David Heffernan, Sep 10 '14 at 20:23

score 4 · Accepted Answer · answered Sep 10 '14 at 02:24

4

What you have described is how Copy(), Delete(), and Insert() have ALWAYS worked, even for AnsiString. The functions operate on elements (ie codeunits in Unicode terminology), and always have.

AnsiString is a string of 8bit AnsiChar elements, which can be encoded in any 8bit ANSI/MBCS format, including UTF-8.

UnicodeString (and WideString) is a string of 16bit WideChar elements, which are encoded in UTF-16.

The functions HAVE NEVER taken encoding into account. Not for MBCS AnsiString. Not for UTF-16 UnicodeString. Indexes are absolute element indexes from the beginning of the string.

If you need encoding-aware Copy/Delete/Insert functions that operate on logical codepoint boundaries, where each codepoint may be 1+ elements in the string, then you have to write your own functions, or find third-party functions that do what you need. There is no MBCS/UTF-aware mutilator functions in the RTL.

answered Sep 10 '14 at 02:24

Remy Lebeau

555,201
31
458
770

Thank you very much. Your last paragraph is a very satisfactory answer to my question. I am surprised that Delphi designers decided to put aside the "encoding-aware", as you say, aspects of string manipulation. You reassured me that if I develop _my_ equivalents of **Copy**, **Delete** and **Insert** I am not going to do something which is already and probably better done in Delphi. – jkomorowski Sep 10 '14 at 18:34
Until D2009, `AnsiString` did not have a codepage associated with it, so there was no way for such functions to know how an `AnsiString` was encoded (and assuming the OS default is not flexible enough) and thus how many elements any given codepoint actually occupied in it. With codepage-aware `AnsiString` and UTF-16 `UnicodeString` both introduced in D2009, it is possible to create codepage-aware functions, but the existing functions were already locked in and could not be re-designed without breaking years worth of legacy code. – Remy Lebeau Sep 10 '14 at 19:30
They could have made new functions, but most people don't need that level of functionality for most string tasks, which is why it does not exist in the RTL. People that do can simply write their own as needed. – Remy Lebeau Sep 10 '14 at 19:32

score 2 · Answer 2 · answered Sep 10 '14 at 06:13

2

You should parse Unicode string youself. Fortunaly the Unicode encoding is designed to make parsing easy. Here is an example how to parse UTF8 string:

program Project9;

{$APPTYPE CONSOLE}

uses
  SysUtils;

function GetFirstCodepointSize(const S: UTF8String): Integer;
var
  B: Byte;

begin
  B:= Byte(S[1]);
  if (B and $80 = 0 ) then
    Result:= 1
  else if (B and $E0 = $C0) then
    Result:= 2
  else if (B and $F0 = $E0) then
    Result:= 3
  else if (B and $F8 = $F0) then
    Result:= 4
  else
    Result:= -1; // invalid code
end;

var
  S: string;

begin
  S:= #$61#$13000#$63;
  Writeln(GetFirstCodepointSize(S));
  S:= #$13000#$63;
  Writeln(GetFirstCodepointSize(S));
  S:= #$63;
  Writeln(GetFirstCodepointSize(S));
  Readln;
end.

answered Sep 10 '14 at 06:13

kludg

27,213
5
67
118

I can't quite see what to do with this information. Yes you avoid surrogates. But they are simple to detect as well. If the original data is encoded in UTF-16, a variable length encoding, why switch to UTF-8, another variable length encoding? And you've also not dealt with composition. Consider this string `'e'#$0301`. – David Heffernan Sep 10 '14 at 08:21
@DavidHeffernan I don't understand why you write such comments. If you prefer to parse UTF16 do it, I never did and prolly will never do it. Composite Unicode graphemes consist of two or more codepoints. Your `'e'#$0301` consists of 2 codepoints. If you have questions about Unicode you'd better ask them as questions. Good luck. – kludg Sep 10 '14 at 09:44
I raise composition because I suspect that the asker is not aware of it, and should be. You could usefully mention that in the answer. If you are going to advocate transcoding then again I think you should say why you prefer transcoding. When I have questions about Unicode that I cannot answer myself, I'll certainly be sure to ask questions. – David Heffernan Sep 10 '14 at 10:26

How manipulate substrings, and not subarrays, of UnicodeString?

2 Answers2

Linked