Why is 'U+' used to designate a Unicode code point?

Question

Why do Unicode code points appear as U+<codepoint>?

For example, U+2202 represents the character ∂.

Why not U- (dash or hyphen character) or anything else?

score 139 · Accepted Answer · answered Jan 17 '12 at 07:39

The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.

Jim DeLaHunt · Answer 2 · 2017-11-20T17:18:52.900

The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).

The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".

My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.

I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).

The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:

u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits

Thanks for this explaination @Jim. It is really helpful. I would looking at those linked docs. — Senthil Kumaran, Jan 17 '12 at 14:57
http://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html also supports U+HHHH and U-HHHHHHHH. — Shawn Kovac, Sep 08 '15 at 18:20

score 8 · Answer 3 · answered Aug 13 '09 at 18:19

8

It depends on what version of the Unicode standard you are talking about. From Wikipedia:

Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits to indicate a code unit, not a code point.

answered Aug 13 '09 at 18:19

Sean Bright

118,630
17
138
146

That was the helpful reference. But the reason for that change is not mentioned. Was it just a whim of the committee? – Senthil Kumaran Aug 13 '09 at 18:23
2

I don't see the "U-" convention in either [The Unicode Standard 3.0.0](http://www.unicode.org/versions/Unicode3.0.0/) or [The Unicode Standard 2.0.0](http://www.unicode.org/versions/Unicode2.0.0/) as archived on the Unicode Consortium's web site. I think Wikipedia is wrong here. – Jim DeLaHunt Jan 17 '12 at 07:08
1

It's in the preface (http://www.unicode.org/versions/Unicode3.0.0/Preface.pdf), but only mentioned briefly. – Sean Bright Jan 17 '12 at 11:33

score 4 · Answer 4 · edited Dec 22 '15 at 14:25

4

It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)

edited Dec 22 '15 at 14:25

mirabilos

5,123
2
46
72

answered May 28 '11 at 09:57

Mihai Nita

5,547
27
27

1

They didn't even have to flip a coin: `x` (`/ˈɛks/`) sounds more like `hex` than `h` (`/eɪtʃ/`) does. – Frédéric Hamidi May 28 '11 at 10:03
2

@FrédéricHamidi but VB uses `&hB9`, Pascal uses `$B9`, Intel syntax assembly uses `0B9h` – phuclv May 10 '17 at 00:57
Thanks phuclv :-) Yes, the examples were not random :-) – Mihai Nita Jul 15 '19 at 15:28

Why is 'U+' used to designate a Unicode code point?

4 Answers4

Linked

Related