89

Why do Unicode code points appear as U+<codepoint>?

For example, U+2202 represents the character .

Why not U- (dash or hyphen character) or anything else?

DavidRR
  • 18,291
  • 25
  • 109
  • 191
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131

4 Answers4

139

The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
17

The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).

The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".

My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.

I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).

The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:

  • u'xyz' to indicate a Unicode string, a sequence of Unicode characters
  • '\uxxxx' to indicate a string with a unicode character denoted by four hex digits
  • '\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
Jim DeLaHunt
  • 10,960
  • 3
  • 45
  • 74
8

It depends on what version of the Unicode standard you are talking about. From Wikipedia:

Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits to indicate a code unit, not a code point.

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
  • That was the helpful reference. But the reason for that change is not mentioned. Was it just a whim of the committee? – Senthil Kumaran Aug 13 '09 at 18:23
  • 2
    I don't see the "U-" convention in either [The Unicode Standard 3.0.0](http://www.unicode.org/versions/Unicode3.0.0/) or [The Unicode Standard 2.0.0](http://www.unicode.org/versions/Unicode2.0.0/) as archived on the Unicode Consortium's web site. I think Wikipedia is wrong here. – Jim DeLaHunt Jan 17 '12 at 07:08
  • 1
    It's in the preface (http://www.unicode.org/versions/Unicode3.0.0/Preface.pdf), but only mentioned briefly. – Sean Bright Jan 17 '12 at 11:33
4

It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)

mirabilos
  • 5,123
  • 2
  • 46
  • 72
Mihai Nita
  • 5,547
  • 27
  • 27