3

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string. For example imagine that my input string is:

00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100

I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.

So far I have built this expression:

(1[^1]*?){7,}

which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:

11000000011101111

and the

1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011

parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.

It tried to combine the aforementioned regex with:

(?=(^[01]{0,20}))

to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.

Does anyone have an idea gow to accomplish this? I have put this example in regex101 as a quick reference.

Thank you very much!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
dimly
  • 35
  • 1
  • 4
  • I wonder if this .NET regex (click table) would do: [`(?=((?:10*?){7}))(?=.{0,20}(?<=\1))`](http://regexstorm.net/tester?p=%28%3f%3d%28%28%3f%3a10*%3f%29%7b7%7d%29%29%28%3f%3d.%7b0%2c20%7d%28%3f%3c%3d%5c1%29%29&i=00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100) Just for fun:) – bobble bubble Nov 16 '19 at 15:26
  • This regex is of course nonsense for practical use but I like the regex riddles :) @Anonymous already put a nice answer capturing [overlapping matches](https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) inside a lookahead and filtering. In .NET regex it's possible to use an already captured part of a previous group inside a lookbehind. The idea is - at the same starting point - after capturing the first part in a 2nd lookahead look back, if the captured part matches within zero to 20 characters from behind. – bobble bubble Nov 16 '19 at 15:45
  • Further worth to mention, that already in `11000000011101111` there are 5 possible matches where an `1` appears at least 7 times: `110000000111011`,`1100000001110111`,`11000000011101111`,`1000000011101111011`,`1000000011101111` None of the regexes will match all of those. It needs to be done on program side. Even capturing inside a lookahead can only match one match at each position in the string. That can be the shortest or longest. – bobble bubble Nov 16 '19 at 16:50
  • Thank you for the explanations @bobblebubble. Makes sense! Turns out the answer by anonymous did what I wanted and is now clearer to me that such a task cannot be done with regex alone - needs to be done also on the program side. Thank you both again! – dimly Nov 17 '19 at 16:08

2 Answers2

1

This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.

You could also iterate over the matches. Example in Python:

import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
Anonymous
  • 11,748
  • 6
  • 35
  • 57
  • Thank you for your reply. You mean that I would have to keep the first expression I built, get all returned results and then loop over them to keep only the ones matching my second critrion (<20 characters), is that correct? I thought of the same but wanted to checked if maybe a better solution was available. – dimly Nov 16 '19 at 14:11
  • @dimly Yes, you can do that or not use regex at all. I added an example. Regex is not good at this kind of problem – Anonymous Nov 16 '19 at 14:15
1

The next Python snippet is an attempt to get the desired sequences using only the regular expression.

import re
r = r'''
(?mx)
(                 # the 1st capturing group will contain the desired sequence
1                 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
   (.*$))         # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6})       # there must be six more 1s in the sequence
(?=.{0,13}        # complement the 1st capturing group to 20 characters
\2)               # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])

Output:

['1010101010101', '11111100000000000001', '110000000111011', '1111111']

You can find an exhaustive explanation of this regular expression on RegEx101.

Andrei Odegov
  • 2,925
  • 2
  • 15
  • 21
  • Very slick solution @Andrei Odegov! I like it!! Thanks for taking the time to elaborate on it and for the detailed explanations! – dimly Nov 19 '19 at 09:56