Suppose I have a list of n
values (parts
in the example below) and a regular expression with n
capturing groups (pattern
).
import re
parts = ['a', 'b', 'c'] # An arbitrary set of values with `n = len(parts)`
pattern = '^(\w+), (\w+) and also (\w+)!$' # A regular expression pattern with `n` capturing groups
regex = re.compile(pattern)
all(regex.match(unknown_string).groups() == parts) is True # Should evaluate to `True`
How do we find unknown_string
? (Let's consider only cases where there is a single string that satisfies.)
Of course, we can see in this case that unknown_string = 'a, b and also c!'
however I have not been able to find a robust method for reconstructing this string from only the captured values (parts
) and the regular expression pattern (regex
).
A good solution would be able to handle complex syntax that includes features such as non-capturing groups, lookaround assertions, etc.
Attempt at a solution:
Let's assume that, given the list of parts and the regex, we have enough information to reconstruct the original string (i.e. there is a unique string which satisfies the conditions put forth). For this reason, lets restrict ourselves to regex patterns of the form '^.*$'
.
In the simplest case where there are no nested parentheses and all parentheses form capturing groups (as opposed to escaped parens \(
and \)
, non-capturing groups, lookarounds, etc.) then we could just do something simple like
def naive_reconstruct(pattern, parts):
# Replace capturing groups with elements from `parts`
for p in parts:
pattern = re.sub(r'\([^\(\)]*\)', p, pattern, 1)
return pattern
However this fails for anything more complex such as mentioned above. Let's consider the pattern '^(\w(?!\d))_(\d{4})_(?:xxx(\d{2}))?$'
, which will match 'a_1234_'
and 'a_1234_xxx01'
but not 'a2_1234_xxx01'
. (For our purposes, we can assume all values in parts
are acceptable matches for the corresponding capturing group. )
If we had parts = ['b', '5678', '02']
we would expect to get out 'b_5678_xxx02'
, whereas is parts = ['b', '5678', None]
the output should be 'b_5678_'
. In both cases the simple solution fails.
pattern = '^(\w(?!\d))_(\d{4})_(?:xxx(\d{2}))?$'
naive_reconstruct(pattern, ['b', '5678', '02'])
# Returns '^5678_02_(?:xxx(\\d{2}))?$'
naive_reconstruct(pattern, ['b', '5678', None])
# Raises TypeError
One way to get around the artifacts left by non-capturing groups is by first removing them from the string. I managed to accomplish this for simple cases where there is no more than one set of nested parentheses inside of any non-capturing group or lookaround assertion:
def reconstruct(pattern, parts):
# First, eliminate non-capturing groups, etc., from the string
noncapturing = r'\(\?[:=!]([^\(\)]*?(?:\(.*?\))*?[^\(\)]*?)\)'
reconstructed = re.compile(noncapturing).sub('\g<1>', pattern)
# Now replace capturing groups with elements from `parts`
for p in parts:
reconstructed = re.sub(r'\(.*?\)', p, reconstructed, 1)
# Remove special characters, replace escaped characters
reconstructed = reconstructed.replace('?','')
reconstructed = reconstructed.replace(r'\.','.')
# Strip string start & end anchors
reconstructed = reconstructed.strip('^$')
return reconstructed
This solution seems to work but fails to generalize to more complex cases with more nested parentheses. Additionally, it depends on post hoc removal of residual special characters which were not subbed out earlier which somehow feels kind of sloppy (although, not a deal-breaker).
Obviously the regex parser can detect which parts of a string correspond to capturing groups --- is there some way to access or otherwise emulate this functionality in a fully generalizable way?
Related:
- Python Regex instantly replace groups addresses it but does not attempt to generalize to mor complex regular expressions.