Reconstructing a string with a regex pattern and capturing groups

Question

Suppose I have a list of n values (parts in the example below) and a regular expression with n capturing groups (pattern).

import re

parts = ['a', 'b', 'c'] # An arbitrary set of values with `n = len(parts)`

pattern = '^(\w+), (\w+) and also (\w+)!$' # A regular expression pattern with `n` capturing groups
regex = re.compile(pattern)

all(regex.match(unknown_string).groups() == parts) is True # Should evaluate to `True`

How do we find unknown_string? (Let's consider only cases where there is a single string that satisfies.)

Of course, we can see in this case that unknown_string = 'a, b and also c!' however I have not been able to find a robust method for reconstructing this string from only the captured values (parts) and the regular expression pattern (regex).

A good solution would be able to handle complex syntax that includes features such as non-capturing groups, lookaround assertions, etc.

Attempt at a solution:

Let's assume that, given the list of parts and the regex, we have enough information to reconstruct the original string (i.e. there is a unique string which satisfies the conditions put forth). For this reason, lets restrict ourselves to regex patterns of the form '^.*$'.

In the simplest case where there are no nested parentheses and all parentheses form capturing groups (as opposed to escaped parens $ and $, non-capturing groups, lookarounds, etc.) then we could just do something simple like

def naive_reconstruct(pattern, parts):

    # Replace capturing groups with elements from `parts`
    for p in parts:
        pattern = re.sub(r'\([^\(\)]*\)', p, pattern, 1)
    
    return pattern

However this fails for anything more complex such as mentioned above. Let's consider the pattern '^(\w(?!\d))_(\d{4})_(?:xxx(\d{2}))?$', which will match 'a_1234_' and 'a_1234_xxx01' but not 'a2_1234_xxx01'. (For our purposes, we can assume all values in parts are acceptable matches for the corresponding capturing group. )

If we had parts = ['b', '5678', '02'] we would expect to get out 'b_5678_xxx02', whereas is parts = ['b', '5678', None] the output should be 'b_5678_'. In both cases the simple solution fails.

pattern = '^(\w(?!\d))_(\d{4})_(?:xxx(\d{2}))?$'

naive_reconstruct(pattern, ['b', '5678', '02'])
# Returns '^5678_02_(?:xxx(\\d{2}))?$'

naive_reconstruct(pattern, ['b', '5678', None])
# Raises TypeError

One way to get around the artifacts left by non-capturing groups is by first removing them from the string. I managed to accomplish this for simple cases where there is no more than one set of nested parentheses inside of any non-capturing group or lookaround assertion:

def reconstruct(pattern, parts):

    # First, eliminate non-capturing groups, etc., from the string
    noncapturing = r'\(\?[:=!]([^\(\)]*?(?:\(.*?\))*?[^\(\)]*?)\)'
    reconstructed = re.compile(noncapturing).sub('\g<1>', pattern)

    # Now replace capturing groups with elements from `parts`
    for p in parts:
        reconstructed = re.sub(r'\(.*?\)', p, reconstructed, 1)

    # Remove special characters, replace escaped characters
    reconstructed = reconstructed.replace('?','')
    reconstructed = reconstructed.replace(r'\.','.')

    # Strip string start & end anchors
    reconstructed = reconstructed.strip('^$')

    return reconstructed

This solution seems to work but fails to generalize to more complex cases with more nested parentheses. Additionally, it depends on post hoc removal of residual special characters which were not subbed out earlier which somehow feels kind of sloppy (although, not a deal-breaker).

Obviously the regex parser can detect which parts of a string correspond to capturing groups --- is there some way to access or otherwise emulate this functionality in a fully generalizable way?

Related:

Python Regex instantly replace groups addresses it but does not attempt to generalize to mor complex regular expressions.

I think what you actually want is `str.format`. Or you can use the other method of formatting: `"%s, %s and %s" % mylist` — בנימין כהן, Jul 27 '20 at 08:07
Interesting problem. Just to understand - you don't have access to the 'unknown string' or to the resulting re.match object? — Roy2012, Jul 27 '20 at 08:13
Also - should the reconstruct function work for any regex, or just for the regex given above? — Roy2012, Jul 27 '20 at 08:17
Since your `pattern` did not have a `$` anchor nor did you use `matchall`, there are, in fact, an infinite number of possible values for `unknown_string` that would have resulted in `mylist`. Also, you want `list(regex.match(unknown_string).groups()) == mylist` because `all(regex.match(unknown_string) == mylist) ` generates `TypeError: 'bool' object is not iterable`. — Booboo, Jul 27 '20 at 13:54
@Booboo excellent point, I forgot to include those anchors when writing the question. I have edited it to include them. — corvus, Jul 27 '20 at 14:23
@Roy2012 yes, exactly. All I have is the pattern and a list of values corresponding to the capturing groups that would be returned by `regex.match(unknown_string).groups()`. Also I fixed a small typo in the original post (as @Booboo pointed out) that might have made this confusing. — corvus, Jul 27 '20 at 14:26
@בנימיןכהן There is a degree of preprocessing that is needed before `str.format` will work. For some regex patterns, this is very much non-trivial. — corvus, Jul 27 '20 at 14:28

Reconstructing a string with a regex pattern and capturing groups

0 Answers0