16

Assuming the following:

>>> square = '²'      # Superscript Two (Unicode U+00B2)
>>> cube  = '³'       # Superscript Three (Unicode U+00B3)

Curiously:

>>> square.isdigit()
True
>>> cube.isdigit()
True

OK, let's convert those "digits" to integer:

>>> int(square)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'

Oooops!

Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
Lacobus
  • 1,590
  • 12
  • 20
  • I would vote-to-close this as a duplicate of [How can I check if a string represents an int, without using try/except?](https://stackoverflow.com/questions/1265665/how-can-i-check-if-a-string-represents-an-int-without-using-try-except) if that didn't ask specifically about avoiding try/except (the top-voted answer is just `isdigit`, the second-from-top one is the one you want). Also related: [What's the difference between str.isdigit, isnumeric and isdecimal in Python?](https://stackoverflow.com/questions/44891070/whats-the-difference-between-str-isdigit-isnumeric-and-isdecimal-in-python) – Bernhard Barker Sep 22 '21 at 11:27
  • Are you trying to check whether a string can be converted to an integer, or are you just trying to understand what `isdigit` does? If you're trying to answer the latter, the documentation would be the first place to turn. – Bernhard Barker Sep 22 '21 at 12:05

1 Answers1

20

str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:

str.isdigit()

Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.

In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?

Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 2
    I had no idea about `isdigit` doing that. Does this hold true for `isnumeric`? I'm going to have to grep my github org for that and fix it tomorrow. – flakes Sep 22 '21 at 00:39
  • 2
    @flakes: `isnumeric` is also useless for detecting whether a string represents a number. It tests characters for Unicode properties Numeric_Type=Digit, Numeric_Type=Decimal or Numeric_Type=Numeric. – user2357112 Sep 22 '21 at 00:41
  • @user2357112supportsMonica Wild. It's refactoring day for me tomorrow. – flakes Sep 22 '21 at 00:43
  • 4
    @flakes: `isnumeric` is worse; it returns `True` for all things `isdigit` covers, plus a third category, `Numeric_Type=Numeric`. `isdecimal` seems to be the most strict test (and in fact, `'²'.isdecimal()` returns `False`, unlike `isdigit` and `isnumeric`), so it gets you closer to what constitutes a valid `int`, but again, the correct solution to "Is this a legal `int`?" is "Call `int()` on it and catch the `ValueError` if it fails"; prechecks for string properties will always be either too strict (`-2` won't pass these tests, but `int()` can parse it) or too lax (allows `²` and the like). – ShadowRanger Sep 22 '21 at 00:43
  • 1
    Don't forget that ```isalnum``` will also be ```True```. – sj95126 Sep 22 '21 at 00:45
  • 2
    Note that Numeric_Type=Digit is no longer used for new characters, so the test `isdigit` performs is now even less useful than it used to be - new characters that would previously have received Numeric_Type=Digit now receive Numeric_Type=Numeric. See [Unicode Standard 14.0 chapter 4.6](https://www.unicode.org/versions/Unicode14.0.0/ch04.pdf). – user2357112 Sep 22 '21 at 00:45
  • 1
    @sj95126: Yeah, oddly, `isalnum` is defined in terms of the union of `isalpha`, `isdecimal`, `isdigit` and `isnumeric` (it's true if all the characters pass at least one of them). The `isdigit` part seems pointless since `isnumeric` is a strict superset of `isdigit`, and I *think* (not 100%) that it's also a superset of `isdecimal`. But yeah, it's using the most broad definition of "numeric". – ShadowRanger Sep 22 '21 at 00:47
  • 1
    [set(decimal) < set(digit) < set(numeric)](https://tio.run/##fY5BCsIwEEX3niK7pFBKixsXepiQpu1IMg2TCIp49pjYoKC2f/d5/DfjbmGacX9wFGOvFVhp2Ilx3pxnQDGACZqED9SAL7hmVjqhJqoZSRy1aK9d16ZUKbseRgirhgy393ixmkCtGQredjgCDGLgd6NRlK@rZHzwb5b/@UvKnR/mdfgYj@xVF8lS3rs0zMsYnw) – no comment Sep 22 '21 at 01:09