Index character instead of byte in the Delphi string

Question

I am reading the document on index to Delphi string, as below:

http://docwiki.embarcadero.com/RADStudio/Tokyo/en/String_Types_(Delphi)

One statement said:

You can index a string variable just as you would an array. If S is a non-UnicodeString string variable and i, an integer expression, S[i] represents the ith byte in S, which may not be the ith character or an entire character at all for a multibyte character string (MBCS). Similarly, indexing a UnicodeString variable results in an element that may not be an entire character. If the string contains characters in the Basic Multilingual Plane (BMP), all characters are 2 bytes, so indexing the string gets characters. However, if some characters are not in the BMP, an indexed element may be a surrogate pair - not an entire character.

If I understand correctly, S[i] is index to the i-th byte of the string. If S is a UnicodeString, then S[1] is the first byte, S[2] is the 2nd byte of the first character, S[3] is the first byte of the second character, etc. If that is the case, then how do I index the character instead of the byte inside a string? I need to index characters, not bytes.

No, a Unicode "character" in Delphi is two bytes, and if `S` is a `string` (=`UnicodeString` in Delphi 2009 or later), `S[i]` is such a two-byte "character". But only Unicode characters in the BMP can be represented as such a two-byte unit, so `S[i]` might indeed be only one of the two parts in a surrogate pair. — Andreas Rejbrand, Oct 24 '18 at 09:22
(In the vast majority of all applications, you only need the BMP. It contains tens of thousands of characters. I don't know your application, though.) — Andreas Rejbrand, Oct 24 '18 at 09:25
See [Detecting and Retrieving codepoints and surrogates from a Delphi String](https://stackoverflow.com/q/32020126/576719). — LU RD, Oct 24 '18 at 09:26
So in a simple string like "Test ∫⌬dx ᚭᛘᚠ ቚ꡵씒ᱶⵞꮙ៚ㆯ", `S[i]` is the complete character. — Andreas Rejbrand, Oct 24 '18 at 09:33
Please, when adding tags, add the correct one. Do not tag with `delphi-xe2` but with `delphi-xe3` since you actually are using `Delphi XE3`. — Tom Brunberg, Oct 24 '18 at 10:54
You quote this sentence: *If S is a non-UnicodeString string ..., S[i] represents the ith byte in S,...*. Then later you conclude: *If S is a Unicode string, then S[1] is the first byte, S[2] is the 2nd byte of the first character,...*. Do you see the contradiction between those two sentences. — Tom Brunberg, Oct 24 '18 at 10:58
@TomBrunberg, Thank you for your explanation. As for the tag, when I input delphi, I get delphi-xe2, delphi, etc. but no delphi-xe3, so I think delphi-xe3 is a new tag and my reputation is not enough to create a new one. — alancc, Oct 25 '18 at 03:34
@TomBrunberg Then if S is a non-Unicode MBCS string, how to index character instead of byte? — alancc, Oct 25 '18 at 03:35
Re. the `delphi-xe3` tag, did it ever strike you to type more characters until you get the correct choise, or even the complete tag? Your reputation has nothing to do with what tags are shown, and you don't need to create new ones as all delphi tags already exist. — Tom Brunberg, Oct 26 '18 at 12:23

score 4 · Accepted Answer · answered Oct 24 '18 at 09:42

In Delphi, S[i] is a char aka widechar. But this is not an Unicode "character", it is an UTF-16 encoded value in 16 bits (2 bytes). In previous century, i.e. until 1996, Unicode was 16-bit, but it is not the case any more! Please read carrefully the Unicode FAQ.

You may need several widechar to have a whole Unicode codepoint = more or less what we usually call "character". And even this may be wrong, if diacritics are used.

UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.)

Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

see UTF-16 FAQ

For proper decoding of Unicode codepoints in Delphi, see Detecting and Retrieving codepoints and surrogates from a Delphi String (link by @LURD in comments)

Index character instead of byte in the Delphi string

1 Answers1