Regex to get the size of capturing group

Question

Is is possible to write a regex where I can somehow refer the "length of the first capture group" later in the same regex? What I am trying to achieve here is to capture continuous occurrences of 1's that are followed by the exact number of continuous occurrences of 2's.

I want something like

r"(1*)(2{length(\1)})" # where `length(\1)` should give me the length of capture group 1

Should Match

1122 # two 1's followed by two 2's
111222 # three 1's followed by three 2's
121122111222 # should match `12` and `1122` and `111222` separately

Should Not Match

122 # there are two 2's following one 1
112 # there are two 1's but only one 2
11222 # same as above but with different occurrences
11122 # same as above but with different occurrences

I was thinking that might be the case, thought I could post a question to get it clarified — python_user, Oct 31 '20 at 16:23
Fixed alternations. Like 1, 2, 3, 4 _ones_ `(?:11112222|111222|1122|12)`, etc .. — , Oct 31 '20 at 16:27
ohh, there is no fixed range, but it doesn't go more than 50 continuous occurrences — python_user, Oct 31 '20 at 16:28
You can do it with the python `regex` engine. Its done with recursion. — , Oct 31 '20 at 16:29
No need for another answer, [here it is](https://stackoverflow.com/a/3644267/3832970). — Wiktor Stribiżew, Oct 31 '20 at 19:22
@WiktorStribiżew the dupe mentions how to do this in Java and PHP. I wanted to know how to do this in python — python_user, Nov 01 '20 at 04:24
thanks @anubhava, if you have a better solution do feel free to add it here — python_user, Nov 01 '20 at 04:47
@python_learner: I have posted an improvement on accepted answer in comments below — anubhava, Nov 01 '20 at 05:00
The PCRE solution that is posted at https://stackoverflow.com/a/3644267/3832970 is supported by PyPi regex library, so it is the same. — Wiktor Stribiżew, Nov 01 '20 at 10:27

score 1 · Accepted Answer · 2020-10-31T20:48:39.363

Update I guess you could go with some absurd Java lookahead recursion simulation that won't work
or you could use Python to do it ?

>>> import regex
>>> rx_1_2 = r"(?m)^(1(?>(?1))*2)$"
>>>
>>> input = '''
... 111222222
... 11222234
... 1111222
... 111222
... 1122
... 12
... '''
>>> res = regex.findall( rx_1_2, input )
>>> print( res )
['111222', '1122', '12']

That this question was marked a duplicate of a Java simulated recursion
using lookaheads is astoundingly bad judgment on whoever covered this question up by marking it a duplicate. Just plain poor judgment...

It can be done with pythons regex module.
Needs to use recursion.
Done this way because it is really just nested delimiters.

1(?>[^12]++|(?R))*2

https://regex101.com/r/4Nxtvl/1

                         # Recursion 
 1                       # 1
 (?>                     # Atomic group
      [^12]++                 # Possesive, not 1 or 2
   |                        # or,
      (?R)                    # Recurse the regex
 )*                      # End cluster, do 0 to many times
 2                       # 2

To not allow inner content use 1(?>(?R))*2 https://regex101.com/r/mSUIp0/1

To add boundary conditions, contain the recursion to a group,
then surround it with boundary constructs.

(?<!\d)(1(?>[^12]++|(?1))*2)(?!\d)

https://regex101.com/r/SSr1zV/1

 (?<! \d )               # Not a digit behind
 (                       # (1 start), Recursion code group
    1                       # 1
    (?>                     # Atomic group
       [^12]++                 # Possesive, not 1 or 2
     |                        # or,
       (?1)                    # Recurse the regex group 1
    )*                      # End cluster, do 0 to many times
    2                       # 2
 )                       # (1 end)
 (?! \d )                # Not a digit ahead

To not allow inner content use (?<!\d)(1(?>(?1))*2)(?!\d) https://regex101.com/r/VI6w0Y/1

thanks for answering, recursive regex is something that I am not aware of so +1 for that, but from the regex101 link it must not match any of your test inputs because the number of `2`'s are more than the number of `1`'s for the first two, and third has 4 `1`'s followed by 3 `2`'s. I want the "exact" same occurrence of `1`'s followed by `2`. I have given what not to match in the question — python_user, Oct 31 '20 at 16:39
@python_learner yeah thats just a simple adjustment. Add your boundary requirements as needed before and after. To do that, recurse a _group_ not the whole regex. I'lll put up a mod. It's moot if you don't want to use regex module though. — , Oct 31 '20 at 16:42
Marking this as an answer because this has proven to be useful to me, worth noting `(?m)^(1(?>(?1))*2)$` fails to match `121122111222` but that is something I believe I can tweak later, the dupe was in java and php which is not what this question asked for — python_user, Nov 01 '20 at 04:19

Neil · Answer 2 · 2020-10-31T16:44:14.220

If the largest number of sequential 1's is low enough then you can enumerate the options. something like:

r'(12)|(1122)|(1{3}2{3})' etc

You could even generate the regex. Long regexes can be surprisingly efficient if there isn't too much recursion.

for i in range(1:50):
    regex += r"|1{" + str(i) + r"}2{" + str(i) + r"}"

You'll have to add boundaries as required as well.

If you don't mind making two passes, you can get the length from the re match object:

ONES = re.compile(r'1+')
match = ONES.search(string)
if match is not None:
    length = match.end() - match.start()
TWOS = re.compile(r'2{' + str(length) + r'}')
string = string[match.end():]
match = TWOS.search(string)
...

If you're willing to not use regex, consider splitting into a list and using list operations

Regex to get the size of capturing group

2 Answers2

Linked