Letter digit letter sequence not being detected

Question

I'm trying to parse a text to figure our how many letter-digit-letter sequences are there.

Consider the following string: a123123aas52342ooo345345ooo

I used the following regex:

re.findall(r"[a-zA-Z]+\d+[a-zA-Z]+", string)

The sequences that should be detected are:

a123123aas
aas52342ooo
ooo345345ooo

However, this is what I'm getting:

a123123aas
ooo345345ooo

What am I doing wrong? I have a feeling that regex might not be the solution to this problem. Any suggestions?

@Barmar on second thought it might not be an exact duplicate as suggested solution will give `aas...`, `as....`, and `s...` as results and OP only interested in `aas...` — Tomerikoo, Jul 12 '19 at 20:03
@Barmar, yeah I just had a look at the duplicate solution. Didn't work on this one lol. Any thoughts on a potential solution? — Sam, Jul 12 '19 at 20:07
You need to define your requirements more carefully then. Why are some overlapping sequences OK, but not others? — Barmar, Jul 12 '19 at 20:08
See if this gets you closer: https://regex101.com/r/LHlQ5G/1 — acdcjunior, Jul 12 '19 at 20:12

score 0 · Answer 1 · answered Jul 12 '19 at 20:38

This simple expression or maybe a bit modified version of that might likely work on our input strings here:

[a-zA-Z]+\d+[a-zA-Z]+$|[a-zA-Z]+\d+

Test with re.findall

import re

regex = r"[a-zA-Z]+\d+[a-zA-Z]+$|[a-zA-Z]+\d+"

test_str = "a123123aas52342ooo345345ooo"

print(re.findall(regex, test_str))

Output

['a123123', 'aas52342', 'ooo345345ooo']

Test with `re.finditer`

import re

regex = r"[a-zA-Z]+\d+[a-zA-Z]+$|[a-zA-Z]+\d+"

test_str = "a123123aas52342ooo345345ooo"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

RegEx Circuit

jex.im visualizes regular expressions:

score 0 · Accepted Answer · answered Jul 12 '19 at 20:40

A little workaround on the "all overlapping matches" answer:

>>> import re
>>> s = "a123123aas52342ooo345345ooo"
>>> print(re.findall("(?<![a-zA-Z])(?=([a-zA-Z]+\d+[a-zA-Z]+))", s))
['a123123aas', 'aas52342ooo', 'ooo345345ooo']

This basically says:

Look ahead and make sure the pattern required is there and save it
And the added lookbehind makes sure it is the first letter of each letters sequence.

A demo on your example string.

Letter digit letter sequence not being detected

2 Answers2

Test with re.findall

Output

Test with re.finditer

RegEx Circuit

Test with `re.finditer`