Efficiently check if string contains a digit in python

Question

I have a huge amount (GB) of text to process, sentence by sentence. In each sentence I have a costly operation to perform on numbers, so I check that this sentence contains at least one digit. I have done this check using different means and measured those solutions using timeit.

s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' # example

any(c.isdigit() for c in s) 3.61 µs
re.search('\d', s) 402 ns
d = re.compile('\d') d.search(s) 126 ns
'0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s 60ns

The last way is the fastest one, but it is ugly and probably 10x slower than possible.

Of course I could rewrite this in cython, but it seems overkill.

Is there a better pure python solution? In particular, I wonder why you can use str.startswith() and str.endswith() with a tuple argument, but it does not seem to be possible with in operator.

Try `any(d in s for d in '0123456789')` it should be equivalent to the last option but reads better — rdas, May 24 '22 at 15:54
Have you tried regex with reusing a compiled regular expression (`pattern = re.compile(r'\d')`)? — Oli, May 24 '22 at 15:54
Does this answer your question? [Check if a string contains a number](https://stackoverflow.com/questions/19859282/check-if-a-string-contains-a-number) — 0x263A, May 24 '22 at 15:54
@Oli better than non-compiled, but still not as fast as last way. Reviewed this solution in my post — M. Page, May 24 '22 at 16:05
There's yet another way to go about it, I don't think it's the most efficient though: `tbl = ''.maketrans('', '', '0123456789'); txt != txt.translate(tbl)` — Peter, May 24 '22 at 16:19
@0x263A regex sub for processing numbers written using various local number formats (decimal point, thousand sep, ...), and that may be sticked or not with units — M. Page, May 24 '22 at 17:39

score 1 · Accepted Answer · answered May 25 '22 at 09:25

Actual performance might vary depending on your platform and python version, but on my setup (python 3.9.5 / Ubuntu), it turns out that re.match is significantly faster than re.search, and outperforms the long in series version. Also, compiling the regex with [0-9] instead of \d provides a little improvement.

import re
from timeit import timeit

n = 10_000_000
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'

# reference
timeit(lambda: '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s, number=n)
# 2.1005349759998353

# re.search with \d, slower
re.compile('\d')
timeit(lambda: d.search(s), number=n)
# 2.9816031390000717

# re.search with [0-9], better but still slower then reference
d = re.compile('[0-9]')
timeit(lambda: d.search(s), number=n)
# 2.640713582999524

# re.match with [0-9], faster than reference
d = re.compile('[0-9]')
timeit(lambda: d.match(s), number=n)
# 1.5671786130005785

So, on my machine, using re.match with a compiled [0-9] pattern is about 25% faster than the long or ... in chaining. And it looks better too.

Interesting. On my machine, re.compile() + match() is also faster than ref solution. Good job ! — M. Page, May 25 '22 at 12:37

Efficiently check if string contains a digit in python

1 Answers1