-6

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.

However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.

What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?

TylerH
  • 20,799
  • 66
  • 75
  • 101
acupoftea
  • 181
  • 2
  • 11
  • 2
    "that Unicode codepoints aren't a generally useful concept" is wrong as a general statement. Yes, text processing _is_ hard and any naive dealing with "characters" leads to ugly problems but that doesn't mean that programming languages should not provide _support_ for relevant types. Numeric code is complicated too but most languages provide floating point types. – Volker Sep 09 '22 at 14:13
  • 1
    How would you write «programs/libraries specializing in text processing» without dealing with codepoints anyway? ;-) Another point: conceptually, a string is a sequence of codepoints irrespective of its actual encoding. So, when you _extract_ a single codepoint out of it, or _iterate_ over the codepoints encoded in such a string,—where/how you're supposed to keep these codepoints? – kostix Sep 09 '22 at 14:14
  • I also think that "the meaninglessness of codepoints was broadly understood" is a harsh exaggeration. Just because you most likely a lot of code will work on uninterpreted UTF-8 streams only doesn't mean that codepoints are somehow meaningless. – Volker Sep 09 '22 at 14:15
  • 1
    Also note that most prominent Go compiler gc is written in Go and the language specification requires Go source code to be UTF-8 encoded but it _has_ to break this UTF-8 stream down to codepoints to _parse_ the source code. So if codepoints would be "meaningless" and runes and chars useless legacy you could not write a compiler in such a language (in any sensible manner). – Volker Sep 09 '22 at 14:17
  • @Volker, I've glanced over the referenced document, and I think the OP refers to its section 8.3 "Counting coded characters or code points is important." which contrasts dealing with codepoints to dealing with grapheme clusters—in view of visual text manipulation. In that sense, the OP's question makes sense but the answer would require like ⅓ of text of that source document ;-) – kostix Sep 09 '22 at 14:22
  • @kostix if they were needed only in these specialized situations I wouldn't expect to find them as built-in types with 4 letter names - thus the dissonance. A possible answer to the question ("what are they for") would talk about iterating over codepoints and when that's useful. – acupoftea Sep 09 '22 at 14:26
  • @Volker remember that the quoted statements aren't my opinion, but the impression I got from internet writings. If they are incorrect that is a possible answer to the question. – acupoftea Sep 09 '22 at 14:32
  • 1
    Thanks, that indeed makes possible answers less handwavey. Here's an opinionated answer: the referenced doc is a sort of grab-bag of different bits and pieces about Unicode and feels like a whirlwind tour through its wonderful world. Maybe that was precisely intended. The section I mentioned demonstrates how hard it is to deal with complex writing systems properly in different contexts _for specific tasks._ Still, a sequence of bytes which is _any_ encoding of Unicode, UTF-8 included, _is a sequence of code points_ — simply because that's how Unicode encodings are defined. … – kostix Sep 09 '22 at 14:36
  • @kostix I don't know the culture on this site well, could you explain the reasoning for why this is off-topic? I think I've seen questions about unicode, utf8 and uses/usefulness of language features that were upvoted and not closed. The question seems to fit under "a practical, answerable problem that is unique to software development". Practical answers could include "yes these types are legacy from dark times, don't use them unless you're sure" and "these types are generally useful for a b c". – acupoftea Sep 09 '22 at 14:36
  • …Iterating over those code-points is a natural low-level task, and PLs provide means for it. More complex stuff is usually handled separately. For Go, look at [`golang.org/x/text`](https://pkg.go.dev/golang.org/x/text) hierarchy. C and C++ code often uses the ICU library which, itself, is the size of a typical compiler of a programming language (or larger) ;-) – kostix Sep 09 '22 at 14:38
  • «could you explain the reasoning for why this is off-topic»—yes, because it was tagged `go` but did not actually present any particular problem with programming in Go. I think the question could be OK as is on softwareengineering.stackexchange.com or cs.stackexchange.com, but not here. – kostix Sep 09 '22 at 14:41
  • 1
    There is nothing wrong with the quoted statements, they _are_ true, they just are not the _whole_ truth. Processing text on a "character" level (as most programmers are accustomed to) is complicated. If you have to deal with text, treat it as an opaque stream of bytes, make sure it's UTF-8. If you have to break that opaque stream up: Pay attention to the details, there be dragons! A lot of dragons. A fucking lot of dragons, some you never heard or even dreamed of. – Volker Sep 09 '22 at 14:43
  • 1
    @acupoftea the problem with "these types are generally useful for a b c" is that there is inevitably a d, e, f, g, and sometimes h – and then someone else who thinks that they're not useful for b or e, etc. "Answerable" in the context of this site means (IMO) "succinctly and definitively answerable." I'm not familiar with the rules of [cs.se] or [softwareengineering.se] but they may be more suitable venues for these sorts of discussions. – miken32 Sep 09 '22 at 16:38

2 Answers2

1

Texts have many different meaning and usages, so the question is difficult to answer.

First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.

Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.

The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).

Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.

And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
0

The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.

Where? It merely outlines advantages and disadvantages of code points. Two examples are:

Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.

Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.

In other words: code points just index which graphemes Unicode supports.

  • Sometimes they're meant as single characters: one prominent example would be (EURO SIGN), having only the code point U+20AC.

  • Sometimes the same character has multiple code-points as per context: the dollar sign exists as:

    • = U+FE69 (SMALL DOLLAR SIGN)
    • = U+FF04 (FULLWIDTH DOLLAR SIGN)
    • = U+1F4B2 (HEAVY DOLLAR SIGN)

    Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.

  • Sometimes multiple code points can be combined to form up a single character:

    • á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
    • = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.

    Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.

  • As for homoglyphs code points clearly distinguish the contextual meaning:

    Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.

  • Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:

    • = U+4ECA
    • = U+5165
    • = U+624D

Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31