How to write a Python regex that matches strings with both words and digits, excluding digits-only strings?

Question

I want to write a regex that matches a string that may contain both words and digits and not digits only.

I used this regex [A-z+\d*], but it does not work.

Some matched samples:

expression123
123expression
exp123ression

Not matched sample:

1235234567544

Can you help me with this one? Thank you in advance

"...that may contain both words and digits..." is unclear because of "may" (e.g., could the string contain a dollar sign?), the use of "words" as opposed to "word characters" and the fact that a digit is a word character. Also, is an empty string to be matched? Please clarify by editing your question. — Cary Swoveland, May 20 '23 at 21:42
Remove quantifiers from inside the class, convert characters to lowercase and use the case insensitivity flag if necessary, quantify the class as a whole, anchor the expression to the line in its entirety and use a lookahead to exclude lines that only contain digits and you'll get something like `^(?!\d+$)[a-z\d]+$`. — oriberu, May 21 '23 at 08:08

Jan · Accepted Answer · 2023-05-21T08:06:35.750

Lookarounds to the rescue!

^(?!\d+$)\w+$

This uses a negative lookahead construct and anchors, see a demo on regex101.com

Note that you could have the same result with pure Python code alone:

samples = ["expression123", "123expression", "exp123ression", "1235234567544"]
 
filtered = [item for item in samples if not item.isdigit()]
print(filtered)

# ['expression123', '123expression', 'exp123ression']

See another demo on ideone.com.

With both approaches you wouldn't account for input strings like -1 or 1.0 (they'd be allowed).

Tests

As the discussion somewhat arose, here's a small test suite for different sample sizes and expressions:

import string, random, re, timeit


class RegexTester():
    samples = []
    expressions_to_test = {"Cary": "^(?=.*\D)\w+$",
                           "Jan": "^(?!\d+$)\w+$"}

    def __init__(self, sample_size=100, word_size=10, times=100):
        self.sample_size = sample_size
        self.word_size = word_size
        self.times = times

        # generate samples
        self.samples = ["".join(random.choices(string.ascii_letters + string.digits, k=self.word_size))
                        for _ in range(self.sample_size)]

        # compile the expressions in question
        for key, expression in self.expressions_to_test.items():
            self.expressions_to_test[key] = {"raw": expression, "compiled": re.compile(expression)}

    def describe_sample(self):
        only_digits = [item for item in self.samples if all(char.isdigit() for char in item)]
        return only_digits

    def test_expressions(self):

        def regex_test(samples, expr):
            return [expr.search(item) for item in samples]

        for key, values in self.expressions_to_test.items():
            t = timeit.Timer(lambda: regex_test(self.samples, values["compiled"]))

            print("{key}, Times: {times}, Result: {result}".format(key=key,
                                                                   times=self.times,
                                                                   result=t.timeit(100)))


rt = RegexTester(sample_size=10 ** 5, word_size=10, times=10 ** 4)
#rt.describe_sample()
rt.test_expressions()

Which for a sample size of 10^5, a word size of 10 gave the comparable results for the both expressions:

Cary, Times: 10000, Result: 6.1406331
Jan, Times: 10000, Result: 5.948537699999999

When you set the sample size to 10^4 and the word size to 10^3, the result is the same:

Cary, Times: 10000, Result: 10.1723557
Jan, Times: 10000, Result: 9.697761900000001

You'll get significant differences when the strings consist only of numbers (aka the samples are generated only with numbers):

Cary, Times: 10000, Result: 25.4842013
Jan, Times: 10000, Result: 17.3708319

Note that this is randomly generated text and due to the method of generating it, the longer the strings are, the less likely they are to consist only of numbers. In the end it will depend on the actual text inputs.

@CarySwoveland: It would indeed but `\d+` would probably stop earlier than dot-star and then backtrack. — Jan, May 20 '23 at 22:21
I expect which is faster would depend on the text. If the string were long and contained few digits it generally would not take long to find a non-digit with `^(?=.*\D)...`. Yours matches empty strings as well if you write `^(?!\d+$)\w*$`. — Cary Swoveland, May 21 '23 at 00:25

score 2 · Answer 2 · answered May 20 '23 at 19:55

Another solution: simply search for other character than digit in your string:

import re

data = [
'expression123',
'123expression',
'exp123ression',
'1235234567544'
]

for t in data:
    m = re.search(r'\D', t)
    if m:
        print(t)

Prints:

expression123
123expression
exp123ression

Cary Swoveland · Answer 3 · 2023-05-21T06:15:41.980

2

You may attempt to match the following regular expression.

^(?:\w*[a-zA-Z_]\w*)?$

Demo

This matches empty strings. If the string must contain at least one character this can be simplified to

^\w*[a-zA-Z_]\w*$

edited May 21 '23 at 06:15

answered May 20 '23 at 21:49

Cary Swoveland

106,649
6
63
100

3

You could prevent some backtracking matching optional digits first like `^\d*[a-zA-Z_]\w*$` https://regex101.com/r/N6K5yt/1 – The fourth bird May 21 '23 at 08:32

The fourth bird · Answer 4 · 2023-05-21T09:22:11.613

2

Note that [A-z] matches more than [A-Za-z]

If you want to check for alnum and not only digits in Python 3:

strings = [
    "expression123",
    "123expression",
    "exp123ression",
    "1235234567544",
]

for s in strings:
    if not s.isnumeric() and s.isalnum():
        print(s)

Output

expression123
123expression
exp123ression

Note that both .isnumeric() and .isalnum() are unicode aware:

edited May 21 '23 at 09:22

answered May 21 '23 at 09:04

The fourth bird

154,723
16
55
70

score 0 · Answer 5 · answered May 21 '23 at 08:27

0

try this:

 import re

regex = r'^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]+$'

strings = ['expression123', '123expression', 'exp123ression', '1235234567544']

for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

this would give

Matched: expression123
Matched: 123expression
Matched: exp123ression
Not matched: 1235234567544

answered May 21 '23 at 08:27

Aftab Udaipurwala

81
4
14

Your regex requires input string to also contain at least one digit. Why? – markalex May 21 '23 at 09:46

How to write a Python regex that matches strings with both words and digits, excluding digits-only strings?

5 Answers5

Tests