Length of a UTF8String in Delphi

Question

how do i get the number of visible charcters in a UTF8String regardless of how many bytes it takes to hold the string?

This is a complex topic to answer. Delphi does not natively provide anything to calculate a visual length, so you would have to do it yourself, or better use a Unicode library, like ICU. Counting Unicode codepoints in a UTF-8 string is fairly easy, but it is not as simple as that, since it may take multiple codepoints to produce 1 visible “character” (grapheme cluster), you have to take into account things like combining codepoints (accents, emojis, etc), no-width codepoints, bi-di modifiers, etc. Things that are not themselves output visually on their own, but can affect a visible output. — Remy Lebeau, Nov 27 '20 at 18:44
[MANAGING UTF-8 STRINGS](https://www.embarcadero.com/images/dm/technical-papers/delphi-and-unicode-marco-cantu.pdf) at pages 15+16. — JosefZ, Nov 27 '20 at 18:49
@JosefZ those pages do not address what the OP is asking for. The `Length` of a `UTF8String` is the number of encoded UTF codeunits, not the number of Unicode codepoints, or the number of visual graphemes the codepoints produce. — Remy Lebeau, Nov 27 '20 at 18:51
If `U` is a `UTF8String`, then `Length(string(U))` is an often acceptable approximation to the actual number of "visible characters". In many applications, the number is exactly right. But as Remy points out, there are quite a few exceptions. (Even for an ordinary `string`, `Length(s)` is only an approximation to the number of "visible characters".) — Andreas Rejbrand, Nov 27 '20 at 18:59
You can convert UTF8String to UnicodeString and use CharNextW (if you under Windows) to count chars: https://stackoverflow.com/a/32020629 — zed, Nov 27 '20 at 19:17
length( UTF8Decode( utf8String ) ) the number of the characters. The so called visible charnumber depends on the view. Char sequences like :O visualized differently in Notepad++ and in some chat views. So you can only define the number of characters. Apps interpret this raw contents differently. This apps could give you back the visible content length as they interpret it. — The Bitman, Nov 27 '20 at 19:23
@TheBitman: I'd be more concerned about things like surrogate pairs, combining characters, etc., which affect rendering even in the most plain edit control. — Andreas Rejbrand, Nov 27 '20 at 19:35
@RemyLebeau, 15 years ago I stopped using Delphi, since it didn't support utf-8 and it was a big pain in the back to try and support it myself. Heard supports it now and they even give you a community edition compiler and came back to try. It still doesn't support it correctly! When counting characters in a utf-8 string you don't care how many bytes. When you want to remove last character in `1¢`, `10$` , `100€`, `1.5₧` you want to remove the character regardless of how many byte it occupies and that is still not something you can easily do. I think Delphi need to learn a few things from C# — AaA, Dec 23 '22 at 05:22
@AaA Delphi switched to native Unicode strings (including UTF-8 strings) just a couple of years after you stopped using it. However, even C# doesn't handle the situation at hand. C# strings are just like Delphi's `UnicodeString` - they are simply a sequence of UTF-16 code units, with no regards to codepoints or grapheme clusters. — Remy Lebeau, Dec 23 '22 at 08:56
What I meant is at least C# has simple methods to convert from each encoding directly to internal string and back, e.g. methods for `Encoding.UTF8.GetString()` which also exist for ASCII, Base64 ... (utf-8 data is always in bytes) Its just simple, as long as you know format of incoming data you can convert it to internal string and back with ease. Unless I'm missing something, with delphi you need to do `UnicodeString(str)` which you are always worried, did I use the correct string type? Not even considering many different string formats and definitions that are added to language. — AaA, Dec 23 '22 at 09:33

score 0 · Answer 1 · answered Nov 27 '20 at 19:37

0

If you under Windows, try CharNextW:

uses
  Winapi.Windows,
  System.SysUtils;

function GetCharsCount(const AStr: UTF8String): Integer;
var
  P: PWideChar;
begin
  Result := 0;
  P := PWideChar(UnicodeString(AStr));
  while P <> '' do begin
    Inc(Result);
    P := CharNextW(P);
  end;
end;

answered Nov 27 '20 at 19:37

zed

798
7
12

It might be worth pointing out that this approach can also be used for ordinary `string`s, not only `UTF8String`s (indeed, this code begins by converting the latter to the former). – Andreas Rejbrand Nov 27 '20 at 21:28
@DavidHeffernan actually, it does: "*This function works with default "user" expectations of characters when dealing with diacritics. For example: A string that contains U+0061 U+030a "LATIN SMALL LETTER A" + COMBINING RING ABOVE" — which looks like "å", will advance two code points, not one. A string that contains U+0061 U+0301 U+0302 U+0303 U+0304 — which looks like "a´^~¯", will advance five code points, not one, and so on.*" However, `P <> ''` does need to be changed to `P^ <> #0`, at least – Remy Lebeau Nov 28 '20 at 00:35
I think this is fairly close. Take `∫àéôx̅ȧa̎b`, for example. There are arguably 9 visible characters. `Length(S)` returns `17`, while `GetCharsCount` returns `10`. And how often do you use U+200B: ZERO WIDTH SPACE anyway? (Or do you have some other counterexample in mind, @DavidHeffernan?) Also, since the OP is talking about `UTF8String` and not defining "visible characters" very precisely, maybe the OP would even consider `Length(string(U))` to be good enough for his/her needs? – Andreas Rejbrand Nov 28 '20 at 00:38
Interestingly, `GetCharsCount` fails to treat a *leading* surrogate pair as a single character: `b` (without the space) is said to be 2 chars, but `b` is said to be 3 chars. Isn't that a bit strange? – Andreas Rejbrand Nov 28 '20 at 01:15
Sorry, comment deleted. My mistake. Answer would benefit from an explanation of all this though. Or perhaps just made a dupe of remy's linked answer. – David Heffernan Nov 28 '20 at 07:34
That `while P <> ''` (or the alternative `while P^ <> #0`) will stop at a null character somewhere _in_ the string, or? How should I count when my strings contain such null characters? For example serialized PHP objects have these. – Anse Sep 17 '22 at 11:29

Length of a UTF8String in Delphi

1 Answers1