1

I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.

  • A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)

My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.

  • "Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)

But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.

  • (p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units." According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!

  • (p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is." I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!

For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:

  • (1) Any array of uint8
  • (2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
  • (3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8

To make things concrete, here are a few examples:

  • [ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
  • [ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.

I am just lost with this "standard". Do you have any clear definition?

InsideLoop
  • 6,063
  • 2
  • 28
  • 55
  • This has been done to death on here, use the search function there are some very good answers on other similar questions. – Matt Jun 17 '17 at 17:59
  • @Matt. I have searched, and I did not find. Moreover, if there is a clear definition, I want a quote from the standard. If you have one, I'll be glad if you can share it. Or at least give an answer to the last examples: are `[ 0xFF ]` and `[ 0xB0 ]` valid `UTF-8 unicode strings` – InsideLoop Jun 17 '17 at 18:02
  • Well ok to be fair you might not get your strict definition from the spec from the other answers, rather a good explanation, eg https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16. However I think the first definition you've posted is the answer and it doesn't contradict the second two, because a Unicode string (at this level of abstraction) doesn't guarantee the validity of its component parts. – Matt Jun 17 '17 at 18:13
  • @Matt. I already know a few things on unicode. I have a good feeling on what a code unit, a scalar value, a code point, and a grapheme are. I also know the problems of normalization which make the notion of equality difficult in unicode. I also know how UTF-8 and UTF-16 encode a scalar value. But I am really looking for a formal definition (such as what is given in mathematics for instance) of a UTF-8 unicode string. The first definition contradicts the 2 others as [ 0xFF ] is not valid according the 2 others but is valid according to the first definition. – InsideLoop Jun 17 '17 at 18:21

1 Answers1

0

My feeling was that a unicode string is a sequence of unicode scalar values.

No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:

D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.

An application shall only create well-formed strings, of course:

If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.

But the standard also contains some sections on how to deal with ill-formed input sequences.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113