Is there a standard way of dealing with regular expressions when a prefix is present for multiple instances?

Question

For instance:

There are multiple instances in a column of my dataframe (that happens to be a nested JSON) in which there is something like 'prefix("stuff i want"),'.

How do I replace 'prefix("stuff i want"),' with "stuff i want"?

As stated, the variables are strings. I updated the question to reflect this more clearly. — seeaemearohin, Mar 04 '20 at 20:44
@JohnnyMopp Yes. There is a comma included as well. The desired result is x = '"hello",'. — seeaemearohin, Mar 04 '20 at 20:47
@Tomerikoo There is actually a comma always at the end. I updated the question to include that. — seeaemearohin, Mar 04 '20 at 20:48
Can there be parens in there, such as `'prefix("hello (there)")'`? — tdelaney, Mar 04 '20 at 20:49
Duplicate: https://stackoverflow.com/questions/4894069/regular-expression-to-return-text-between-parenthesis — Tomerikoo, Mar 04 '20 at 21:07
@seeaemearohin, check out the bottom of my answer for an answer to "how do I replace... with stuff I want" — Todd, Mar 04 '20 at 21:22
@seeaemearohin also check out my answer for a more elaborate way to replace matching text: https://stackoverflow.com/a/60514155/7915759 — Todd, Mar 04 '20 at 21:24

Todd · Accepted Answer · 2020-03-05T07:07:29.070

using grouping

The question was updated to be more specific after I put together what's immediately below - for a more specific answer, you can scroll down to referencing groups where I have examples on how to do replacements.

Using groups to single out matching text in a target string:

>>> re.findall(r"prefix\((.*?)\)", 'prefix("hello")')[0]
'"hello"'

or more generally to capture anything in parenthesis:

re.findall(r"\((.*?)\)", 'prefix("hello")')[0]
'"hello"'

demo

>>> target = """
... x = 'prefix("hello"),'
... y = 'prefix("hi"),'
... z = 'prefix("hey"),'
... """
>>> 
>>> re.findall(r"prefix\((.*?)\)", target)
['"hello"', '"hi"', '"hey"']
>>>

I believe the feature of regular expressions that would benefit you would be "groups".

When you use unescaped parenthesis (.....) around text in your regexpr you create a matching group. This is a powerful way to extract bits and pieces from the target text.

reference https://docs.python.org/3/library/re.html

(...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use ( or ), or enclose them inside a character class: [(], [)].

referencing groups

Also, you can reference your matching groups using a backslash and a number: \1 would be the first match group. This is useful for operations like re.sub()

>>> print(re.sub(r"prefix\((.*?)\)", r"prefix(   && \1 &&   )", target))

x = 'prefix(   && "hello" &&   ),'
y = 'prefix(   && "hi" &&   ),'
z = 'prefix(   && "hey" &&   ),'

>>>

You can do things like above where I put && around the group to make it stand out.

More along the lines of answering the exact question, we can extract the text we need and dispense with 'prefix' and other text we don't need:

>>> print(re.sub(r"'prefix\((.*?)\),'", r"\1", target))

x = "hello"
y = "hi"
z = "hey"

>>>

(assuming you want to keep the double quotes in the output, and lose the tick quotes and commas)

For more elaborate match replacement operations, you can check out my other post here: https://stackoverflow.com/a/60514155/7915759. In that example I show how to match text from a list of words, and how to replace matches with text from a dictionary.

positive lookbehind/ahead assertion

In most cases lookahead/behind isn't needed, but they are also an option. They offer a way to specify preceding/subsequent text without grouping. As in other posts to the question, the syntax is (?<=...) for lookbehind, and (?=...) for lookahead.

These subexpressions can complicate the overall expression and make it more difficult to read. Maybe not always, but they should be used sparingly in my opinion.

The question of whether to use them or not if there's a way to do it without them is: does it make your code easier to read? Whichever way is easier to grok for you and other developers is the one you should go with.

Here's an example using lookahead/behind assertions that does the same thing as the previous example. Which is easier to comprehend - you can decide:

>>> print(re.sub(r"(?<=prefix\()(.*?)(?=\))", r"   && \1 &&   ", target))

x = 'prefix(   && "hello" &&   ),'
y = 'prefix(   && "hi" &&   ),'
z = 'prefix(   && "hey" &&   ),'

In the particular case of the question, using these assertions would lose the prefix... text from the match and you wouldn't be able to remove it from the re.sub() output. So in this case, they wouldn't work.

This is not what I would consider a general response. hello could be anything. — seeaemearohin, Mar 04 '20 at 20:49
@seeaemearohin you are probably new to regex... where it says `hello` is the actual string to be searched... the pattern is before that `"\((.*?)\)"` — Tomerikoo, Mar 04 '20 at 20:50

Superluminal · Answer 2 · 2020-03-04T20:53:27.497

1

Make the use of lookarounds. The regex for this would be:

regex = r'(?<=prefix\()".*"(?=\))'
re.findall(regex, your_string)

Demo https://regex101.com/r/0ZEXnN/2/

edited Mar 04 '20 at 20:53

answered Mar 04 '20 at 20:49

Superluminal

947
10
23

This is probably what the OP wants. Can you please put that within r"" quotes? – Todd Mar 04 '20 at 20:53
@Predicate. Todd is right. This is closer to what I want. Is it possible to do something like re.replace(regex, "hello") where "hello" is, in general, anything? – seeaemearohin Mar 04 '20 at 20:55
@seeaemearohin You mean `re.sub` – Tomerikoo Mar 04 '20 at 20:56
@Tomerikoo Yes, re.sub. How would that work? I think that would be the solution. – seeaemearohin Mar 04 '20 at 20:57
This can also be achieved without lookahead or lookback just using regular parenthesis to group matching text @seeaemearohin – Todd Mar 04 '20 at 20:59
@Tomerikoo Something like re.sub(regex, .*) where .* is whatever is between the parentheses - in this case being "hello". – seeaemearohin Mar 04 '20 at 20:59
@seeaemearohin And dont forget to upvote and accept the answer that helped you. – Superluminal Mar 04 '20 at 21:01

score 0 · Answer 3 · answered Mar 04 '20 at 20:52

0

This regex works

x = 'prefix("hello")'
y = 'prefix("hi")'
z = 'prefix("hey")'

import re
for test in (x,y,z):
    print(test, re.match(r".*?\(([^\)]*)", test).group(1))

result

prefix("hello") "hello"
prefix("hi") "hi"
prefix("hey") "hey"

answered Mar 04 '20 at 20:52

tdelaney

73,364
6
83
116

Is there a standard way of dealing with regular expressions when a prefix is present for multiple instances?

3 Answers3