I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.
- A
unicode scalar value
is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)
My feeling was that a unicode string is a sequence of unicode scalar values
. I would define a UTF-8 unicode string
as a sequence of unicode scalar values encoded in UTF-8
. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.
- "Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)
But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.
(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units." According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is." I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!
For a start, let's seek for a clear definition of an UTF-8 unicode string
. So far, I can propose 3 definitions, but the real one (if there is) might be different:
- (1) Any array of uint8
- (2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in
UTF-8
- (3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in
UTF-8
To make things concrete, here are a few examples:
- [ 0xFF ] would be a
UTF-8 unicode string
according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from anUTF-8
encoded unicode scalar value. - [ 0xB0 ] would be a
UTF-8 unicode string
according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.
I am just lost with this "standard". Do you have any clear definition?