Formal definition of a unicode string

Question

I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.

A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)

My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.

"Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)

But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.

(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units." According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is." I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!

For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:

(1) Any array of uint8
(2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
(3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8

To make things concrete, here are a few examples:

[ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
[ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.

I am just lost with this "standard". Do you have any clear definition?

This has been done to death on here, use the search function there are some very good answers on other similar questions. — Matt, Jun 17 '17 at 17:59
@Matt. I have searched, and I did not find. Moreover, if there is a clear definition, I want a quote from the standard. If you have one, I'll be glad if you can share it. Or at least give an answer to the last examples: are `[ 0xFF ]` and `[ 0xB0 ]` valid `UTF-8 unicode strings` — InsideLoop, Jun 17 '17 at 18:02
Well ok to be fair you might not get your strict definition from the spec from the other answers, rather a good explanation, eg https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16. However I think the first definition you've posted is the answer and it doesn't contradict the second two, because a Unicode string (at this level of abstraction) doesn't guarantee the validity of its component parts. — Matt, Jun 17 '17 at 18:13
@Matt. I already know a few things on unicode. I have a good feeling on what a code unit, a scalar value, a code point, and a grapheme are. I also know the problems of normalization which make the notion of equality difficult in unicode. I also know how UTF-8 and UTF-16 encode a scalar value. But I am really looking for a formal definition (such as what is given in mathematics for instance) of a UTF-8 unicode string. The first definition contradicts the 2 others as [ 0xFF ] is not valid according the 2 others but is valid according to the first definition. — InsideLoop, Jun 17 '17 at 18:21

score 0 · Answer 1 · answered Jul 09 '17 at 16:09

My feeling was that a unicode string is a sequence of unicode scalar values.

No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:

D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.

An application shall only create well-formed strings, of course:

If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.

But the standard also contains some sections on how to deal with ill-formed input sequences.

Formal definition of a unicode string

1 Answers1