2

I want to write a regex that matches a string that may contain both words and digits and not digits only.

I used this regex [A-z+\d*], but it does not work.

Some matched samples:

expression123
123expression
exp123ression

Not matched sample:

1235234567544

Can you help me with this one? Thank you in advance

lemon
  • 14,875
  • 6
  • 18
  • 38
rasool
  • 31
  • 3
  • 1
    "...that may contain both words and digits..." is unclear because of "may" (e.g., could the string contain a dollar sign?), the use of "words" as opposed to "word characters" and the fact that a digit is a word character. Also, is an empty string to be matched? Please clarify by editing your question. – Cary Swoveland May 20 '23 at 21:42
  • Remove quantifiers from inside the class, convert characters to lowercase and use the case insensitivity flag if necessary, quantify the class as a whole, anchor the expression to the line in its entirety and use a lookahead to exclude lines that only contain digits and you'll get something like `^(?!\d+$)[a-z\d]+$`. – oriberu May 21 '23 at 08:08
  • If you clarify your question I will remove my downvote. – Cary Swoveland May 21 '23 at 15:32

5 Answers5

7

Lookarounds to the rescue!

^(?!\d+$)\w+$

This uses a negative lookahead construct and anchors, see a demo on regex101.com


Note that you could have the same result with pure Python code alone:

samples = ["expression123", "123expression", "exp123ression", "1235234567544"]
 
filtered = [item for item in samples if not item.isdigit()]
print(filtered)

# ['expression123', '123expression', 'exp123ression']

See another demo on ideone.com.

With both approaches you wouldn't account for input strings like -1 or 1.0 (they'd be allowed).


Tests

As the discussion somewhat arose, here's a small test suite for different sample sizes and expressions:

import string, random, re, timeit


class RegexTester():
    samples = []
    expressions_to_test = {"Cary": "^(?=.*\D)\w+$",
                           "Jan": "^(?!\d+$)\w+$"}

    def __init__(self, sample_size=100, word_size=10, times=100):
        self.sample_size = sample_size
        self.word_size = word_size
        self.times = times

        # generate samples
        self.samples = ["".join(random.choices(string.ascii_letters + string.digits, k=self.word_size))
                        for _ in range(self.sample_size)]

        # compile the expressions in question
        for key, expression in self.expressions_to_test.items():
            self.expressions_to_test[key] = {"raw": expression, "compiled": re.compile(expression)}

    def describe_sample(self):
        only_digits = [item for item in self.samples if all(char.isdigit() for char in item)]
        return only_digits

    def test_expressions(self):

        def regex_test(samples, expr):
            return [expr.search(item) for item in samples]

        for key, values in self.expressions_to_test.items():
            t = timeit.Timer(lambda: regex_test(self.samples, values["compiled"]))

            print("{key}, Times: {times}, Result: {result}".format(key=key,
                                                                   times=self.times,
                                                                   result=t.timeit(100)))


rt = RegexTester(sample_size=10 ** 5, word_size=10, times=10 ** 4)
#rt.describe_sample()
rt.test_expressions()

Which for a sample size of 10^5, a word size of 10 gave the comparable results for the both expressions:

Cary, Times: 10000, Result: 6.1406331
Jan, Times: 10000, Result: 5.948537699999999

When you set the sample size to 10^4 and the word size to 10^3, the result is the same:

Cary, Times: 10000, Result: 10.1723557
Jan, Times: 10000, Result: 9.697761900000001

You'll get significant differences when the strings consist only of numbers (aka the samples are generated only with numbers):

Cary, Times: 10000, Result: 25.4842013
Jan, Times: 10000, Result: 17.3708319

Note that this is randomly generated text and due to the method of generating it, the longer the strings are, the less likely they are to consist only of numbers. In the end it will depend on the actual text inputs.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    `^(?=.*\D)\w+$` also works. – Cary Swoveland May 20 '23 at 21:51
  • @CarySwoveland: It would indeed but `\d+` would probably stop earlier than dot-star and then backtrack. – Jan May 20 '23 at 22:21
  • I expect which is faster would depend on the text. If the string were long and contained few digits it generally would not take long to find a non-digit with `^(?=.*\D)...`. Yours matches empty strings as well if you write `^(?!\d+$)\w*$`. – Cary Swoveland May 21 '23 at 00:25
  • 1
    @CarySwoveland: Added some test thoughts. – Jan May 21 '23 at 08:07
2

Another solution: simply search for other character than digit in your string:

import re

data = [
'expression123',
'123expression',
'exp123ression',
'1235234567544'
]

for t in data:
    m = re.search(r'\D', t)
    if m:
        print(t)

Prints:

expression123
123expression
exp123ression
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
2

You may attempt to match the following regular expression.

^(?:\w*[a-zA-Z_]\w*)?$

Demo

This matches empty strings. If the string must contain at least one character this can be simplified to

^\w*[a-zA-Z_]\w*$
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
2

Note that [A-z] matches more than [A-Za-z]

If you want to check for alnum and not only digits in Python 3:

strings = [
    "expression123",
    "123expression",
    "exp123ression",
    "1235234567544",
]

for s in strings:
    if not s.isnumeric() and s.isalnum():
        print(s)

Output

expression123
123expression
exp123ression

Note that both .isnumeric() and .isalnum() are unicode aware:

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

try this:

 import re

regex = r'^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]+$'

strings = ['expression123', '123expression', 'exp123ression', '1235234567544']

for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

this would give

Matched: expression123
Matched: 123expression
Matched: exp123ression
Not matched: 1235234567544