4

I want a regular expression, which returns only digits, which are within a word, but I can only find expressions, which returns all digits in a string.

I've used this example: text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'

The following code returns all digits, but I am only interested in ['5', '3', '4'] import re print(re.findall(r'\d+', text))

Any suggestions?

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
Kiri
  • 55
  • 4

2 Answers2

1

You can use

re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)

This regex will extract all one or more digit chunks that are immediately preceded or followed with an ASCII letter.

A fully Unicode version for Python re would look like

(?<=[^\W\d_])\d+|\d+(?=[^\W\d_])

where [^\W\d_] matches any Unicode letter.

See the regex demo for reference.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    [^\W\d_] doesn't exactly match any Unicode letter. Actually, it isn't based on, or uses the Unicode definition of \w or \W. A Unicode compliant version of \w includes the characters in \p{gc=Mark}, while re module included them in \W instead. Compare with the regex module which has a more Unicode compliant implementation of \w and \W. Python documentation rarely indicates where it differs from Unicode. – Andj Mar 16 '23 at 00:04
  • @Andj See [Match any unicode letter?](https://stackoverflow.com/a/6314634/3832970) – Wiktor Stribiżew Mar 16 '23 at 00:15
  • 1
    @wiktor_stribiżew, considering the example you link to doesn't use anything that would match \p{gc=Mark}. Compile a pattern `pattern = re.compile(r'[^\w]', re.U)` then try `re.sub(pattern, "", 'Stribiżew')` then try `re.sub(pattern, "", unicodedata.normalize("NFD",'Stribiżew'))`. The first will give you Stribiżew, the second will give you `Stribizew` with the combining character stripped out. – Andj Mar 16 '23 at 01:57
  • 1
    A Unicode compliant implementation of `\w` would match U+0307, the re module doesn't. – Andj Mar 16 '23 at 02:00
-1

An approach with str.translate, without the use of regex or re module:

from string import ascii_letters

delete_dict = {sp_character: '' for sp_character in ascii_letters}
table = str.maketrans(delete_dict)

text = 'I 77! need 1:5 this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'

print([res for s in text.rstrip('.').split()
       if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(table)) and res.isnumeric()])

Out:

['5', '3', '4']

Performance

I was curious so I did some benchmark tests to compare performance against other approaches. Looks like str.translate is faster even than the regex implementation.

Here is my benchmark code with timeit:

import re
from string import ascii_letters
from timeit import timeit


_NUM_RE = re.compile(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])')

delete_dict = {sp_character: '' for sp_character in ascii_letters}
_TABLE = str.maketrans(delete_dict)

text = 'I need this number inside my wor5d, but also this word3 and this 4word, but not this 1 and not this 555.'


def main():
    n = 100_000

    print('regex:         ', timeit("re.findall(r'(?<=[a-zA-Z])\d+|\d+(?=[a-zA-Z])', text)",
                 globals=globals(), number=n))

    print('regex (opt):   ', (timeit("_NUM_RE.findall(text)",
                 globals=globals(), number=n)))

    print('iter_char:     ', timeit("""
k=set()
for x in range(1,len(text)-1):
    if text[x-1].isdigit() and text[x].isalpha():
        k.add(text[x-1])
    if text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit() and text[x+1].isalpha():
        k.add(text[x])
    if text[x-1].isalpha() and text[x].isdigit():
        k.add(text[x])
    """, globals=globals(), number=n))

    print('str.translate: ', timeit("""
[
    res for s in text.rstrip('.').split()
    if not (s2 := s.rstrip(',')).isnumeric() and (res := s2.translate(_TABLE)) and res.isnumeric()
]
    """, globals=globals(), number=n))


if __name__ == '__main__':
    main()

Results (Mac OS X - M1):

regex:          0.5315765410050517
regex (opt):    0.5069837079936406
iter_char:      2.5037198749923846
str.translate:  0.37348733299586456
rv.kvetch
  • 9,940
  • 3
  • 24
  • 53