9

According to the Official Unicode Consortium code chart, all of these are numeric:

⅐   ⅑   ⅒   ⅓   ⅔   ⅕   ⅖   ⅗   ⅘   ⅙   ⅚   ⅛   ⅜   ⅝   ⅞   ⅟
Ⅰ   Ⅱ   Ⅲ   Ⅳ   Ⅴ   Ⅵ   Ⅶ   Ⅷ   Ⅸ   Ⅹ   Ⅺ   Ⅻ   Ⅼ   Ⅽ   Ⅾ   Ⅿ
ⅰ   ⅱ   ⅲ   ⅳ   ⅴ   ⅵ   ⅶ   ⅷ   ⅸ   ⅹ   ⅺ   ⅻ   ⅼ   ⅽ   ⅾ   ⅿ
ↀ   ↁ   ↂ   Ↄ   ↄ   ↅ   ↆ   ↇ   ↈ   ↉   ↊   ↋

However, when I ask Python to tell me which ones are numeric, they all are (even ) except for four:

In [252]: print([k for k in "⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↃↄↅↆↇↈ↉↊↋" if not k.isnumeric()])
['Ↄ', 'ↄ', '↊', '↋']

Those are:

  • Ↄ Roman Numeral Reversed One Hundred
  • ↄ Latin Small Letter Reversed C
  • ↊ Turned Digit Two
  • ↋ Turned Digit Three

Why does Python consider those to be not numeric?

gerrit
  • 24,025
  • 17
  • 97
  • 170
  • 7
    Because they're not numbers. – apokryfos Oct 20 '16 at 13:58
  • 1
    @apokryfos Unicode says they are? I would argue that `⅟` is not a number, but Python and Unicode say it is. – gerrit Oct 20 '16 at 14:00
  • I confirm this behavior with both Python 2.7 and 3.5. I have a hypothesis, which I am investigating. – zwol Oct 20 '16 at 14:00
  • Unicode says they **will be** when version 9 is universally adopted. – apokryfos Oct 20 '16 at 14:01
  • @apokryfos Aha, so it's related to Python relating to an older Unicode version? – gerrit Oct 20 '16 at 14:02
  • 4
    @apokryfos Regardless, your first comment is a bit nonsense. Your second comment is correct. – Konrad Rudolph Oct 20 '16 at 14:02
  • @KonradRudolph it depends. If the question is about completeness of python when it comes to reading text and finding which parts of it are numbers then yes, those should be numbers. However when it's about doing mathematical operations those symbols should not be valid. – apokryfos Oct 20 '16 at 14:06
  • @apokryfos Wow, this will break a lot of code assuming isnumeric means [0-9] – Filip Haglund Oct 20 '16 at 14:14
  • 2
    @Filip Well, `isnumeric` never\* meant "0-9", so that code was broken all along. (\* In any recent version of Python that I'm aware of.) – deceze Oct 20 '16 at 14:17
  • 1
    @apokryfos UnicodeData.txt does not agree with your claim that these characters are numeric in 9.0.0 - see my answer. Do you have a reference for your claim? – zwol Oct 20 '16 at 14:27
  • @zwol http://www.unicode.org/charts/PDF/U2150.pdf does say *218A ↊ TURNED DIGIT TWO digit for 10 in some duodecimal systems*. I don't know why `UnicodeData.txt` does not agree. – apokryfos Oct 20 '16 at 14:52
  • See also: *http://stackoverflow.com/q/40148683/974555* – gerrit Oct 20 '16 at 15:04

1 Answers1

10

str.isnumeric is documented to be true for "all characters that have the Unicode numeric value property".

The canonical reference for that property is the Unicode Character Database. The information we need can be dug out of http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt , which is the latest version at time of writing (late 2016) (warning: 1.5MB text file). It's a little tricky to read (the documentation is in UAX#44). I'm going to show its entry for a character that is numeric first, U+3023 HANGZHOU NUMERAL THREE ()

3023;HANGZHOU NUMERAL THREE;Nl;0;L;;;;3;N;;;;;

The eighth semicolon-separated field is the "numeric value" property; in this case, its value is 3, consistent with the name of the character. Python's str.isnumeric is true if and only if this field is nonempty. It can be interrogated directly using unicodedata.numeric.

The third semicolon-separated field is a two-character code giving the "general category"; in this case, "Nl". Most, but not all, of the characters with a numeric value are in one of the "number" categories (first letter of the category code is a N). The exceptions are all hanzi that, depending on context, may or may not signify a number; see UAX#38.

Now, the characters you are asking about:

2183;ROMAN NUMERAL REVERSED ONE HUNDRED;Lu;0;L ;;;;;N;;;    ;2184;
2184;LATIN SMALL LETTER REVERSED C     ;Ll;0;L ;;;;;N;;;2183;    ;2183
218A;TURNED DIGIT TWO                  ;So;0;ON;;;;;N;;;    ;    ;
218B;TURNED DIGIT THREE                ;So;0;ON;;;;;N;;;    ;    ;

These characters do not have a numeric value assigned, so Python's behavior is correct-as-documented.

Note: per https://docs.python.org/3.6/whatsnew/3.6.html, Python will only be updated to Unicode 9.0.0 in the 3.6 release; however, AFAICT these characters have not changed in quite some time.

("Why don't these characters have a numeric value?" is a question that only the Unicode Consortium can answer definitively; if you are interested I suggest bringing it up on one of their mailing lists.)

zwol
  • 135,547
  • 38
  • 252
  • 361
  • 1
    Just to confirm that the same behavior can be seen with a current (3.7.0a0 at ab9835) build of CPython (that includes the [Unicode 9 update](https://github.com/python/cpython/commit/a475929bb31052e223bd552996d8ccf68b201a9b)). – wrwrwr Oct 20 '16 at 14:35
  • Interesting. Then I wonder where [Wikipedia](https://en.wikipedia.org/wiki/Number_Forms) gets its information that `↊`=10 and `↋`=11. – gerrit Oct 20 '16 at 14:47
  • @gerrit Per https://en.wikipedia.org/wiki/Duodecimal this appears to be a convention that never made it into Unicode. – zwol Oct 20 '16 at 14:53
  • @gerrit it's more to do about culture and less to do about actual numbers which is what Unicode is supposed to be helping with. Unicode is not a mathematical standard but rather a linguistic one. – apokryfos Oct 20 '16 at 14:54