Python replace multiple strings while supporting backreferences

Question

There are some nice ways to handle simultaneous multi-string replacement in python. However, I am having trouble creating an efficient function that can do that while also supporting backreferences.

What i would like is to use a dictionary of expression / replacement terms, where the replacement terms may contain backreferences to something matched by the expression.

e.g. (note the \1)

repdict = {'&&':'and', '||':'or', '!([a-zA-Z_])':'not \1'}

I put the SO answer mentioned at the outset into the function below, which works fine for expression / replacement pairs that don't contain backreferences:

def replaceAll(repdict, text):
    repdict = dict((re.escape(k), v) for k, v in repdict.items())
    pattern = re.compile("|".join(repdict.keys()))
    return pattern.sub(lambda m: repdict[re.escape(m.group(0))], text)

However, it doesn't work for the key that does contain a backreference..

>>> replaceAll(repldict, "!newData.exists() || newData.val().length == 1")
'!newData.exists() or newData.val().length == 1'

If i do it manually, it works fine. e.g.:

pattern = re.compile("!([a-zA-Z_])")
pattern.sub(r'not \1', '!newData.exists()')

Works as expected:

'not newData.exists()'

In the fancy function, the escaping seems to be messing up the key that uses the backref, so it never matches anything.

I eventually came up with this. However, note that the problem of supporting backrefs in the input parameters is not solved, i'm just handling it manually in the replacer function:

def replaceAll(repPat, text):
    def replacer(obj):
        match = obj.group(0)
        # manually deal with exclamation mark match..
        if match[:1] == "!": return 'not ' + match[1:]
        # here we naively escape the matched pattern into
        # the format of our dictionary key
        else: return repPat[naive_escaper(match)]

    pattern = re.compile("|".join(repPat.keys()))
    return pattern.sub(replacer, text)

def naive_escaper(string):
    if '=' in string: return string.replace('=', '\=')
    elif '|' in string: return string.replace('|', '\|')
    else: return string

# manually escaping \ and = works fine
repPat = {'!([a-zA-Z_])':'', '&&':'and', '\|\|':'or', '\=\=\=':'=='}
replaceAll(repPat, "(!this && !that) || !this && foo === bar")

Returns:

'(not this and not that) or not this'

So if anyone has an idea how to make a multi-string replacement function that supports backreferences and accepts the replacement terms as input, I'd appreciate your feedback very much.

Why not include any necessary regex escapes in the keys in `replDict`, so you don't need to apply `re.escape`? — jonrsharpe, Jul 16 '17 at 13:23
It's more than an escaping problem: `!([a-zA-Z])` will match `!n` that isn't a key of your replacement dict. You can only use a dict to replace fixed strings. — Casimir et Hippolyte, Jul 16 '17 at 15:30
@jonrsharpe Thanks, yes, that's what I ended up having to do in the final example. But it is pretty clunky, for example, once a match is made, i then have to convert the match into an escaped match (using naive_escaper) just so i can look up the replacement value in the dictionary (repPat). I suspect that maybe the 'obj' parameter being passed to the replacer function has that somewhere in it's innards already, so will look further into this after a few hours of sleep :-). — fzzylogic, Jul 16 '17 at 15:32
Probably the real answer is that a dictionary of regexes isn't the best way to write a transpiler. — jonrsharpe, Jul 16 '17 at 15:39

Aran-Fey · Accepted Answer · 2017-08-11T10:39:35.047

Update: See Angus Hollands' answer for a better alternative.

I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.

However, there are some difficulties. Let's assume a repldict like this:

repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}

If we combine these to a single regex, we get (a)|(b) - so now (b) is no longer group 1, which means its backreference won't work correctly.

Another problem is that we can't tell which replacement to use. If the regex matches the text b, how can we find out that \1b is the appropriate replacement? It's not possible; we don't have enough information.

The solution to these problems is to enclose every dict key in a named group like so:

(?P<group1>(a))|(?P<group2>(b))

Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b refers to "the first group after group2".

Here's the implementation:

def replaceAll(repldict, text):
    # split the dict into two lists because we need the order to be reliable
    keys, repls = zip(*repldict.items())

    # generate a regex pattern from the keys, putting each key in a named group
    # so that we can find out which one of them matched.
    # groups are named "_<idx>" where <idx> is the index of the corresponding
    # replacement text in the list above
    pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k in enumerate(keys))

    def repl(match):
        # find out which key matched. We know that exactly one of the keys has
        # matched, so it's the only named group with a value other than None.
        group_name = next(name for name, value in match.groupdict().items()
                          if value is not None)
        group_index = int(group_name[1:])

        # now that we know which group matched, we can retrieve the
        # corresponding replacement text
        repl_text = repls[group_index]

        # now we'll manually search for backreferences in the
        # replacement text and substitute them
        def repl_backreference(m):
            reference_index = int(m.group(1))

            # return the corresponding group's value from the original match
            # +1 because regex starts counting at 1
            return match.group(group_index + reference_index + 1)  

        return re.sub(r'\\(\d+)', repl_backreference, repl_text)

    return re.sub(pattern, repl, text)

Tests:

repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )

repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )

# output: not newData.exists() or newData.val().length == 1
#         (not this and not that) or not this and foo == bar

Caveats:

Only numerical backreferences are supported; no named references.
Silently accepts invalid backreferences like {r'(a)': r'\2'}. (These will sometimes throw an error, but not always.)

Angus Hollands · Answer 2 · 2017-08-11T11:32:24.417

Similar solution to Rawing, only precomputing the expensive stuff ahead of time by modifying the group indices in backreferences. Also, using unnamed groups.

Here we silently wrap each case in a capture group, and then update any replacements with backreferences to correctly identify the appropriate subgroup by absolute position. Note, that when using a replacer function, backreferences do not work by default (you need to call match.expand).

import re
from collections import OrderedDict
from functools import partial

pattern_to_replacement = {'&&': 'and', '!([a-zA-Z_]+)': r'not \1'}


def build_replacer(cases):
    ordered_cases = OrderedDict(cases.items())
    replacements = {}

    leading_groups = 0
    for pattern, replacement in ordered_cases.items():
        leading_groups += 1

        # leading_groups is now the absolute position of the root group (back-references should be relative to this)
        group_index = leading_groups
        replacement = absolute_backreference(replacement, group_index)
        replacements[group_index] = replacement

        # This pattern contains N subgroups (determine by compiling pattern)
        subgroups = re.compile(pattern).groups
        leading_groups += subgroups

    catch_all = "|".join("({})".format(p) for p in ordered_cases)
    pattern = re.compile(catch_all)

    def replacer(match):
        replacement_pattern = replacements[match.lastindex]
        return match.expand(replacement_pattern)

    return partial(pattern.sub, replacer)


def absolute_backreference(text, n):
    ref_pat = re.compile(r"\\([0-99])")

    def replacer(match):
        return "\\{}".format(int(match.group(1)) + n)

    return ref_pat.sub(replacer, text)


replacer = build_replacer(pattern_to_replacement)
print(replacer("!this.exists()"))

Nice solution, better than my own (this supports named groups and is more efficient). +1 and I've added a link to this in my answer. — Aran-Fey, Aug 11 '17 at 10:43
Thank you for the feedback. Nice to try out different solutions :) — Angus Hollands, Aug 22 '17 at 14:16

williezh · Answer 3 · 2017-07-16T14:04:01.997

0

Simple is better than complex, code as below is more readable(The reason why you code not work as expected is that ([a-zA-Z_]) should not be in re.escape):

repdict = {
    r'\s*' + re.escape('&&')) + r'\s*': ' and ',
    r'\s*' + re.escape('||') + r'\s*': ' or ',
    re.escape('!') + r'([a-zA-Z_])': r'not \1',
}
def replaceAll(repdict, text):
    for k, v in repdict.items():
        text = re.sub(k, v, text)
    return text

edited Jul 16 '17 at 14:04

answered Jul 16 '17 at 13:55

williezh

917
5
8

1

That's not the answer of what fzzylogic asked. – Alfran Jul 16 '17 at 13:58
@williezh Thanks for your response, i appreciate it. Yes, you're right that re.escape breaks the regex, so escaping only the non regex keys is one approach. But if possible i'd love to devise something that 'knows' which things to escape and which not, so i can forget about it and just call it whenever needed. A for loop is certainly more readable, but i sort of got hooked on Andrew Clarke's approach of rolling up all the selector's into one regex and only doing a single .sub call. – fzzylogic Jul 16 '17 at 15:26

Python replace multiple strings while supporting backreferences

3 Answers3

Linked