3

From unicodedata doc:

unicodedata.digit(chr[, default]) Returns the digit value assigned to the character chr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.

unicodedata.numeric(chr[, default]) Returns the numeric value assigned to the character chr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.

Can anybody explain me the difference between those two functions?

Here ones can read the implementation of both functions but is not evident for me what is the difference from a quick look because I'm not familiar with CPython implementation.

EDIT 1:

Would be nice an example that shows the difference.

EDIT 2:

Examples useful to complement the comments and the spectacular answer from @user2357112:

print(unicodedata.digit('1')) # Decimal digit one.
print(unicodedata.digit('١')) # ARABIC-INDIC digit one
print(unicodedata.digit('¼')) # Not a digit, so "ValueError: not a digit" will be generated.

print(unicodedata.numeric('Ⅱ')) # Roman number two.
print(unicodedata.numeric('¼')) # Fraction to represent one quarter.
  • 1
    I believe `numeric` works for other numeric characters besides the arabic numerals, such as DEVANAGIRI ONE, and so on. – cs95 Aug 28 '17 at 16:38
  • @cᴏʟᴅsᴘᴇᴇᴅ Could you put an example of the difference pls? –  Aug 28 '17 at 16:39
  • 2
    Judging from types and description, digits are for actual digits, and numeric may process things like vulgar fractions (e.g. ¾). – weirdan Aug 28 '17 at 16:40
  • @weirdan from the doc is clear that both functions just accept a character as the first parameter. –  Aug 28 '17 at 16:42
  • Closely related (not sure if dupe or not): https://stackoverflow.com/questions/24384852/difference-between-unicode-isdigit-and-unicode-isnumeric – cs95 Aug 28 '17 at 16:44
  • 1
    @gsi-frank, what they accept is the same, but they differ in what they **return** – weirdan Aug 28 '17 at 16:53
  • @weirdan Just a simple example will clarify all this ;) Anybody? None of what I read here so far clear my mind and I think that an example in this situation is the best way to illustrate. –  Aug 28 '17 at 16:55
  • 2
    @weirdan's example of ¾ seems to fit - it's a single Unicode character (codepoint U+00BE) with a numeric value of 3/4 but no digit value. – Peter DeGlopper Aug 28 '17 at 17:01
  • @PeterDeGlopper You are right. `unicodedata.numeric('¼')` and `unicodedata.digit('¼')` are the examples that illustrate that clearly. Thanks to everybody that bear with this question. –  Aug 28 '17 at 17:06
  • Will be nice If someone put such example as an answer so other people interested in this questions don't have to read the comment thread. –  Aug 28 '17 at 17:09

1 Answers1

6

Short answer:

If a character represents a decimal digit, so things like 1, ¹ (SUPERSCRIPT ONE), (CIRCLED DIGIT ONE), ١ (ARABIC-INDIC DIGIT ONE), unicodedata.digit will return the digit that character represents as an int (so 1 for all of these examples).

If the character represents any numeric value, so things like (VULGAR FRACTION ONE SEVENTH) and all the decimal digit examples, unicodedata.numeric will give that character's numeric value as a float.

For technical reasons, more recent digit characters like (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) may raise a ValueError from unicodedata.digit.


Long answer:

Unicode characters all have a Numeric_Type property. This property can have 4 possible values: Numeric_Type=Decimal, Numeric_Type=Digit, Numeric_Type=Numeric, or Numeric_Type=None.

Quoting the Unicode standard, version 10.0.0, section 4.6,

The Numeric_Type=Decimal property value (which is correlated with the General_Category=Nd property value) is limited to those numeric characters that are used in decimal-radix numbers and for which a full set of digits has been encoded in a contiguous range, with ascending order of Numeric_Value, and with the digit zero as the first code point in the range.

Numeric_Type=Decimal characters are thus decimal digits fitting a few other specific technical requirements.

Decimal digits, as defined in the Unicode Standard by these property assignments, exclude some characters, such as the CJK ideographic digits (see the first ten entries in Table 4-5), which are not encoded in a contiguous sequence. Decimal digits also exclude the compatibility subscript and superscript digits, to prevent simplistic parsers from misinterpreting their values in context. (For more information on superscript and subscripts, see Section 22.4, Superscript and Subscript Symbols.) Traditionally, the Unicode Character Database has given these sets of noncontiguous or compatibility digits the value Numeric_Type=Digit, to recognize the fact that they consist of digit values but do not necessarily meet all the criteria for Numeric_Type=Decimal. However, the distinction between Numeric_Type=Digit and the more generic Numeric_Type=Numeric has proven not to be useful in implementations. As a result, future sets of digits which may be added to the standard and which do not meet the criteria for Numeric_Type=Decimal will simply be assigned the value Numeric_Type=Numeric.

So Numeric_Type=Digit was historically used for other digits not fitting the technical requirements of Numeric_Type=Decimal, but they decided that wasn't useful, and digit characters not meeting the Numeric_Type=Decimal requirements have just been assigned Numeric_Type=Numeric since Unicode 6.3.0. For example, (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) introduced in Unicode 7.0 has Numeric_Type=Numeric.

Numeric_Type=Numeric is for all characters that represent numbers and don't fit in the other categories, and Numeric_Type=None is for characters that don't represent numbers (or at least, don't under normal usage).

All characters with a non-None Numeric_Type property have a Numeric_Value property representing their numeric value. unicodedata.digit will return that value as an int for characters with Numeric_Type=Decimal or Numeric_Type=Digit, and unicodedata.numeric will return that value as a float for characters with any non-None Numeric_Type.

user2357112
  • 260,549
  • 28
  • 431
  • 505