-1

I'm trying to parse a text to figure our how many letter-digit-letter sequences are there.

Consider the following string: a123123aas52342ooo345345ooo

I used the following regex:

re.findall(r"[a-zA-Z]+\d+[a-zA-Z]+", string)

The sequences that should be detected are:

  • a123123aas
  • aas52342ooo
  • ooo345345ooo

However, this is what I'm getting:

  • a123123aas
  • ooo345345ooo

What am I doing wrong? I have a feeling that regex might not be the solution to this problem. Any suggestions?

Barmar
  • 741,623
  • 53
  • 500
  • 612
Sam
  • 641
  • 1
  • 7
  • 17
  • 3
    `findall()` doesn't find overlapping sequences. – Barmar Jul 12 '19 at 19:53
  • @Barmar on second thought it might not be an exact duplicate as suggested solution will give `aas...`, `as....`, and `s...` as results and OP only interested in `aas...` – Tomerikoo Jul 12 '19 at 20:03
  • @Barmar, yeah I just had a look at the duplicate solution. Didn't work on this one lol. Any thoughts on a potential solution? – Sam Jul 12 '19 at 20:07
  • 1
    You need to define your requirements more carefully then. Why are some overlapping sequences OK, but not others? – Barmar Jul 12 '19 at 20:08
  • See if this gets you closer: https://regex101.com/r/LHlQ5G/1 – acdcjunior Jul 12 '19 at 20:12

2 Answers2

0

This simple expression or maybe a bit modified version of that might likely work on our input strings here:

[a-zA-Z]+\d+[a-zA-Z]+$|[a-zA-Z]+\d+

Test with re.findall

import re

regex = r"[a-zA-Z]+\d+[a-zA-Z]+$|[a-zA-Z]+\d+"

test_str = "a123123aas52342ooo345345ooo"

print(re.findall(regex, test_str))

Output

['a123123', 'aas52342', 'ooo345345ooo']

Test with re.finditer

import re

regex = r"[a-zA-Z]+\d+[a-zA-Z]+$|[a-zA-Z]+\d+"

test_str = "a123123aas52342ooo345345ooo"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69
0

A little workaround on the "all overlapping matches" answer:

>>> import re
>>> s = "a123123aas52342ooo345345ooo"
>>> print(re.findall("(?<![a-zA-Z])(?=([a-zA-Z]+\d+[a-zA-Z]+))", s))
['a123123aas', 'aas52342ooo', 'ooo345345ooo']

This basically says:

  • Look ahead and make sure the pattern required is there and save it
  • And the added lookbehind makes sure it is the first letter of each letters sequence.

A demo on your example string.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61