Replacing repeated captures

Question

This is sort of a follow-up to Python regex - Replace single quotes and brackets thread.

The task:

Sample input strings:

RSQ(name['BAKD DK'], name['A DKJ'])
SMT(name['BAKD DK'], name['A DKJ'], name['S QRT'])

Desired outputs:

XYZ(BAKD DK, A DKJ)
XYZ(BAKD DK, A DKJ, S QRT)

The number of name['something']-like items is variable.

The current solution:

Currently, I'm doing it through two separate re.sub() calls:

>>> import re
>>>
>>> s = "RSQ(name['BAKD DK'], name['A DKJ'])"
>>> s1 = re.sub(r"^(\w+)", "XYZ", s)
>>> re.sub(r"name\['(.*?)'\]", r"\1", s1)
'XYZ(BAKD DK, A DKJ)'

The question:

Would it be possible to combine these two re.sub() calls into a single one?

In other words, I want to replace something at the beginning of the string and then multiple similar things after, all of that in one go.

I've looked into regex module - it's ability to capture repeated patterns looks very promising, tried using regex.subf() but failed to make it work.

@PedroLobito yeah, there could be any number of `name['...']` items in the string..that's what makes it difficult..I am not sure how to reference multiple captured groups without knowing how many I've got. Hope the task is clear. — alecxe, May 23 '16 at 01:11
An interesting way to do this would be using a function (since `re.sub()` can take a function instead of a string as the "replacement") but I'm not sure if that would be any cleaner for what you want... — hichris123, May 23 '16 at 01:19
i found a solution..though, it works for PCRE engine only..check **[here](https://regex101.com/r/gX2mP2/1)** — rock321987, May 24 '16 at 15:50
@rock321987 oh, great job! Looks like the magic of the `\G` flag which we don't have in Python `re` or `regex`, right? Thanks. — alecxe, May 24 '16 at 16:05
*Do you think it would be possible to solve it without a replacement function and (somehow) referencing the captured groups in the replacement string?* - Not possible with `regex` nor `re`, but it is possible with Boost or PCRE2 regex. There, you have access to a conditional replacement pattern, where you still right an `if-then` construction/logic. — Wiktor Stribiżew, May 25 '16 at 12:51
The input sample vs. the desired output smacks of symmetry and variable length. There is a sense of nesting as well. This is not a recipe for regular expressions with replacement. Problem 1: Even if you use a PCRE engine to handle balanced text, replacement is a nightmare, you'd have to construct a new string as you go, and involves recursion on a core. Problem 2: If no nesting, there is variable number of the same construct in the body. Conclusion: Dot net is the only viable engine that can match/replace all of these in a single pass. For all other _lame_ engines, it takes 2 passes. — , May 25 '16 at 16:21

Casimir et Hippolyte · Accepted Answer · 2016-05-23T02:16:55.553

13

You can indeed use the regex module and repeated captures. The main interest is that you can check the structure of the matched string:

import regex

regO = regex.compile(r'''
    \w+ \( (?: name\['([^']*)'] (?: ,[ ] | (?=\)) ) )* \)
    ''', regex.VERBOSE);

regO.sub(lambda m: 'XYZ(' + (', '.join(m.captures(1))) + ')', s)

(Note that you can replace "name" by \w+ or anything you want without problems.)

edited May 23 '16 at 02:16

answered May 23 '16 at 01:33

Casimir et Hippolyte

88,009
5
94
125

1

Thanks so much for providing a `regex`-module specific approach! Do you think it would be possible to solve it without a replacement function and (somehow) referencing the captured groups in the replacement string? – alecxe May 23 '16 at 03:43
1

@alecxe: No, you can't because there is no way to build a replacement string or a formatted string for an undetermined number of repeated captures. – Casimir et Hippolyte May 23 '16 at 11:23

Brendan Abel · Answer 2 · 2016-06-01T01:04:09.177

9

You could do this. Though I don't think it's very readable. And doing it this way could get unruly if you start adding more patterns to replace. It takes advantage of the fact that the replacement string can also be a function.

s = "RSQ(name['BAKD DK'], name['A DKJ'])"
re.sub(r"^(\w+)|name\['(.*?)'\]", lambda m: 'XYZ' if m.group(1) else m.group(2), s)

edited Jun 01 '16 at 01:04

answered May 23 '16 at 01:21

Brendan Abel

35,343
14
88
118

score 9 · Answer 3 · edited May 23 '16 at 12:57

Please do not do this in any code I have to maintain.

You are trying to parse syntactically valid Python. Use ast for that. It's more readable, easier to extend to new syntax, and won't fall apart on some weird corner case.

Working sample:

from ast import parse

l = [
    "RSQ(name['BAKD DK'], name['A DKJ'])",
    "SMT(name['BAKD DK'], name['A DKJ'], name['S QRT'])"
]

for item in l:
    tree = parse(item)
    args = [arg.slice.value.s for arg in tree.body[0].value.args]

    output = "XYZ({})".format(", ".join(args))
    print(output)

Prints:

XYZ(BAKD DK, A DKJ)
XYZ(BAKD DK, A DKJ, S QRT)

Out of the box thinking, interesting option, thanks! – alecxe May 23 '16 at 12:57 — alecxe, May 23 '16 at 12:57

score 3 · Answer 4 · answered May 23 '16 at 11:46

3

You can use re.findall() and a simple string formatting:

>>> s = "SMT(name['BAKD DK'], name['A DKJ'], name['S QRT'])"
>>> 
>>> 'XYZ({})'.format(','.join(re.findall(r"'([^']+)'", s)))
'XYZ(BAKD DK,A DKJ,S QRT)'

answered May 23 '16 at 11:46

Mazdak

105,000
18
159
188

1

Except you're basically ignoring the first regex check/match, assuming it exists, and manually doing the `XYZ` replace. – Brendan Abel May 23 '16 at 16:52
1

@BrendanAbel There is no need for that since OP wants to replace all that words with `XYZ` if there is another word except `XYZ` it can be replace in `'XYZ({})'`. – Mazdak May 23 '16 at 17:56

Replacing repeated captures

4 Answers4

Linked