1

Unicode string can contain surrogate pairs (especially emoticons). Now I need to truncate this string to n chars. How can I do it safely without breaking any emoticons ?

zeus
  • 12,173
  • 9
  • 63
  • 184
  • You probably mean "surrogate pairs", not "multi-byte characters". Recall that `sizeof(char) = 2` since Delphi 2009. – Andreas Rejbrand Jun 18 '18 at 20:23
  • @AndreasRejbrand : yes i mean surrogate pairs :) i updated the question ! – zeus Jun 18 '18 at 20:24
  • 2
    See [Detecting and Retrieving codepoints and surrogates from a Delphi String](https://stackoverflow.com/q/32020126/576719) – LU RD Jun 18 '18 at 20:29
  • @LURD : I was hopping that something more easy was already build in delphi as it's quite common operation to truncate a string :( – zeus Jun 18 '18 at 20:33
  • Well, in the RTL there are [System.Character.IsSurrogate](http://docwiki.embarcadero.com/Libraries/en/System.Character.IsSurrogate), [System.Character.IsSurrogatePair](http://docwiki.embarcadero.com/Libraries/en/System.Character.IsSurrogatePair) and example:[CharacterSurrogates (Delphi)](http://docwiki.embarcadero.com/CodeExamples/en/CharacterSurrogates_(Delphi)) – LU RD Jun 18 '18 at 20:42
  • What's wrong with [LeftStr](http://docwiki.embarcadero.com/Libraries/en/System.StrUtils.LeftStr) for example? It operates with characters (so as e.g. [Copy](http://docwiki.embarcadero.com/Libraries/en/System.Copy) or [Delete](http://docwiki.embarcadero.com/Libraries/en/System.Delete) does), not with bytes. – Victoria Jun 18 '18 at 20:45
  • 1
    @Victoria : leftStr use internally Copy and i don't think Copy take care of surrogate :( – zeus Jun 18 '18 at 20:53
  • 1
    What about characters that are composed from multiple code points? But for your question you can walk through the string counting, accounting for surrogate pairs. Not difficult. Not sure there is anything built in. – David Heffernan Jun 18 '18 at 21:02
  • 1
    @Victoria: Indeed, if it would have been that simple, we could have used simply `SetLength` to truncate the string. – Andreas Rejbrand Jun 18 '18 at 21:03
  • @DavidHeffernan right now i didn't find anything build in :( I asked to be sure ... – zeus Jun 18 '18 at 21:08
  • @Andreas, you're right there. So then EMBT has all the related documentation wrong because for human being (sometimes even developer :) is surrogate pair still a character. – Victoria Jun 18 '18 at 21:08
  • @Victoria : especially now with emoticons that are very common ... you have same luck to find surrogate pair (maybe just a little less) in UTF16 string than in utf8 string ... – zeus Jun 18 '18 at 21:10
  • @Victoria Documentation is fine. Even a code point isn't one to one mapped to a glyph. Unicode is much more complex. UTF16 strings are just arrays of 16 bit code units. – David Heffernan Jun 18 '18 at 21:11
  • @David, no, it's not. Surrogate pair is still a character (or symbol say so). And if the documentation of those functions do not say _"we do not count with surrogate pairs"_ as the functions do not, then there is something wrong. – Victoria Jun 18 '18 at 21:15
  • @Victoria No. Surrogate pair just encodes a code point. But that need not be a character. Characters can be composed from multiple code points. Now some documentation may be out of date, and I don't think they document it all, but it's also not Emba's task to redocument all the nuances of UTF16. – David Heffernan Jun 18 '18 at 21:17
  • @Victoria Start here http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Unicode_in_RAD_Studio – David Heffernan Jun 18 '18 at 21:23
  • @David, I see. Thanks for the links, but for me is a surrogate pair still understood as a _character_. How it's represented (no matter if it's a complete transcription of some novel) is not important for me (as a human being). – Victoria Jun 18 '18 at 21:33
  • @Victoria the correct term is code point. Character is a little imprecise. I'm no expert mind you. – David Heffernan Jun 18 '18 at 21:41
  • @David, yes, it is imprecise. But even the extended ASCII code table contains graphemes like e.g. "æ" and almost everyone would refer such as a "character". I would be fine if EMBT consider adding warnings about their string manipulation functions about no support of surrogate pairs if so. – Victoria Jun 18 '18 at 22:01
  • @Victoria To my mind there's not much point discussing anything if we don't use the official terminology. It you want to work with UTF16 then you need to understand it. – David Heffernan Jun 18 '18 at 22:07
  • @David, one last question then, what would you choose as a title for this post? How to truncate a string to n chars considering surrogate pair code points? If so, then you'd be speaking about such pair result as a char ;-) – Victoria Jun 18 '18 at 22:27
  • @Victoria It's one thing the asker being imprecise about the terminology. But the askers don't have such leeway. They are supposed to be experts. – David Heffernan Jun 19 '18 at 05:46
  • @David: "What about characters that are composed from multiple code points?" Remy's answer in the link LURD posted has all that: surrogates and combining marks. And *graphemes*. – Rudy Velthuis Jun 19 '18 at 13:54
  • @Rudy Yes. My question was rhetorical. – David Heffernan Jun 19 '18 at 13:59

1 Answers1

0

The following code should be able to solve your issue:

FUNCTION IsDiacritical(C : CHAR) : BOOLEAN;
  VAR
    W   : WORD ABSOLUTE C;

  BEGIN
    Result:=((W>=$1AB0) AND (W<=$1AFF)) OR
            ((W>=$0300) AND (W<=$036F)) OR
            ((W>=$1DC0) AND (W<=$1DFF))
  END;

FUNCTION GetNextChar(VAR S : STRING) : STRING;
  VAR
    C   : CHAR;
    P   : Cardinal;

  BEGIN
    CASE S.Length OF
      0 : Result:='';
      1 : Result:=S
    ELSE // OTHERWISE //
      Result:=''; P:=1;
      FOR C IN S DO
        IF NOT IsDiacritical(C) THEN
          BREAK
        ELSE BEGIN
          Result:=Result+C;
          INC(P)
        END;
      IF (P<LENGTH(S)) AND IsSurrogatePair(S,P) THEN
        Result:=Result+COPY(S,P,2)
      ELSE
        Result:=Result+COPY(S,P,1)
    END;
    DELETE(S,1,Result.Length)
  END;


FUNCTION GetStringByCodePoints(S : STRING ; CodePoints : Cardinal) : STRING;
  VAR
    I   : Cardinal;

  BEGIN
    Result:='';
    FOR I:=1 TO CodePoints DO Result:=Result+GetNextChar(S)
  END;

PROCEDURE SetLengthByCodePoints(VAR S : STRING ; CodePoints : Cardinal);
  BEGIN
    SetLength(S,GetStringByCodePoints(S,CodePoints).Length)
  END;

The GetStringByCodePoints is analogous to COPY, and SetLengthByCodePoints is analogous to SetLength. Both, however, takes the number of Code Points ("visible characters" or control characters) instead of characters.

If there are more Combining Diacritical code points, the relevant function can be extended to include these. The three groups I check for are the ones I could find by a simple Google search.

HeartWare
  • 7,464
  • 2
  • 26
  • 30