-1

I want to match sequences of one 1 followed by one or more 0s and one 1 (101, 1001, 10001, etc.) and the opposite as well switching the ones and zeros (010, 0110, 01110, etc.)

I can match separately with the regexes "10+1" and "01+0" just fine, but I can't seem to combine them to match either of them in a single regex:

import re
match1 = re.findall('10+1', '10100100011000001')
match2 = re.findall('01+0', '10100100011000001')
match12 = re.findall('10+1|01+0', '10100100011000001')
print(match1 + match2)
print(match12)

The first print with the combined matches from both regexes gets all the matches I’m looking for: ['101', '10001', '1000001', '010', '010', '0110'], while the second print with the matches from the unified regex is missing half of them: ['101', '010', '0110'].

Where am I messing up?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 1
    I don't think you can combine them, (`match12`). The problem is when `101` first matches, then `010` can't match because you have already matched the `01` in the preceding match. This happens in all the matches where when one matches and the regex moves to the remaining characters, you are past the characters you need for the other part of the regex to match. – Chris Charley Aug 04 '23 at 02:27
  • See [How to use regex to find all overlapping matches](https://stackoverflow.com/questions/5616822/how-to-use-regex-to-find-all-overlapping-matches) – The fourth bird Aug 04 '23 at 14:14

2 Answers2

2

Getting overlapping "matches" is actually possible, partly thanks to a quirk of .findall():

If there is exactly one group, return a list of strings matching that group.

re.findall()re | Python documentation

The regex we need will look like this:

(?=       # Match a position (zero-length), followed by
  (10+1)  # '1', one or more '0's and another '1', captured in group 1.
)         #

Try it on regex101.com.

Replace this back to your snippet:

import re

text = '10100100011000001'
match1 = re.findall('(?=(10+1))', text)  # ['101', '1001', '10001', '1000001']

01+0 can also be matched this way:

match2 = re.findall('(?=(01+0))', text)  # ['010', '010', '0110']

(?=(10+1))|(?=(01+0)) will not work as one may expect, since there are two separate groups, not one:

If multiple groups are present, return a list of tuples of strings matching the groups.

re.findall()re | Python documentation

match12 = re.findall('(?=(10+1))|(?=(01+0))', text)
'''
[
  ('101', ''), ('', '010'), ('1001', ''), ('', '010'),
  ('10001', ''), ('', '0110'), ('1000001', '')
]
'''

However, putting the branches inside the first group does work:

match12 = re.findall('(?=(10+1|01+0))', text)
# ['101', '010', '1001', '010', '10001', '0110', '1000001']

Try it on regex101.com.

InSync
  • 4,851
  • 4
  • 8
  • 30
1

While regular expressions seem to be convenient, they are not capable of doing everything and if they do there is just so much complexity to it.

Your example provides the string "10100100011000001" and wants any sequence that starts with 1 has one or many repetitions of 0 and ends with 1 again.

So the logical answer is:

101 - 1001 - 10001 - 1000001

But the regex (not Python-specific) gives you:

101 - 10001 - 1000001

The reason is the trailing 1 in 101 is used up and the regex cannot identify that you may allow reusing characters.

10100100011000001
  |

That is also the reason you cannot combine these two expressions

To be honest, I don't know how one might use a regex to evaluate the pattern you are looking for, but I have a better solution "code it your self":

value = '10100100011000001'
found = []
sio = ''
siz = ''
for char in value:
    siz += char
    sio += char
    if char == '0':
        if '1' in siz and siz[0] == '0':
            found.append(siz)
        siz = '0'

    if char == '1':
        if '0' in sio and sio[0] == '1':
            found.append(sio)
        sio = '1'

print(found)

Also if you don't want repetition in matches, you can use sets instead of lists.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • That doesn't look very [Pythonic](https://en.wiktionary.org/wiki/Pythonic#Adjective), more something you would write in [C](https://en.wikipedia.org/wiki/C_%28programming_language%29). Isn't there another way? – Peter Mortensen Aug 04 '23 at 12:02
  • @PeterMortensen I tried the answer from InSync's it seems to work find though I have never used really fancy regex in production because I don't like to get my code unreadable to others or future me – X-_-FARZA_ D-_-X Aug 04 '23 at 13:11