For instance:
There are multiple instances in a column of my dataframe (that happens to be a nested JSON) in which there is something like 'prefix("stuff i want"),'.
How do I replace 'prefix("stuff i want"),' with "stuff i want"?
For instance:
There are multiple instances in a column of my dataframe (that happens to be a nested JSON) in which there is something like 'prefix("stuff i want"),'.
How do I replace 'prefix("stuff i want"),' with "stuff i want"?
using grouping
The question was updated to be more specific after I put together what's immediately below - for a more specific answer, you can scroll down to referencing groups where I have examples on how to do replacements.
Using groups to single out matching text in a target string:
>>> re.findall(r"prefix\((.*?)\)", 'prefix("hello")')[0]
'"hello"'
or more generally to capture anything in parenthesis:
re.findall(r"\((.*?)\)", 'prefix("hello")')[0]
'"hello"'
demo
>>> target = """
... x = 'prefix("hello"),'
... y = 'prefix("hi"),'
... z = 'prefix("hey"),'
... """
>>>
>>> re.findall(r"prefix\((.*?)\)", target)
['"hello"', '"hi"', '"hey"']
>>>
I believe the feature of regular expressions that would benefit you would be "groups".
When you use unescaped parenthesis (.....)
around text in your regexpr you create a matching group. This is a powerful way to extract bits and pieces from the target text.
reference https://docs.python.org/3/library/re.html
(...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use ( or ), or enclose them inside a character class: [(], [)].
referencing groups
Also, you can reference your matching groups using a backslash and a number: \1
would be the first match group. This is useful for operations like re.sub()
>>> print(re.sub(r"prefix\((.*?)\)", r"prefix( && \1 && )", target))
x = 'prefix( && "hello" && ),'
y = 'prefix( && "hi" && ),'
z = 'prefix( && "hey" && ),'
>>>
You can do things like above where I put &&
around the group to make it stand out.
More along the lines of answering the exact question, we can extract the text we need and dispense with 'prefix' and other text we don't need:
>>> print(re.sub(r"'prefix\((.*?)\),'", r"\1", target))
x = "hello"
y = "hi"
z = "hey"
>>>
(assuming you want to keep the double quotes in the output, and lose the tick quotes and commas)
For more elaborate match replacement operations, you can check out my other post here: https://stackoverflow.com/a/60514155/7915759. In that example I show how to match text from a list of words, and how to replace matches with text from a dictionary.
positive lookbehind/ahead assertion
In most cases lookahead/behind isn't needed, but they are also an option. They offer a way to specify preceding/subsequent text without grouping. As in other posts to the question, the syntax is (?<=...)
for lookbehind, and (?=...)
for lookahead.
These subexpressions can complicate the overall expression and make it more difficult to read. Maybe not always, but they should be used sparingly in my opinion.
The question of whether to use them or not if there's a way to do it without them is: does it make your code easier to read? Whichever way is easier to grok for you and other developers is the one you should go with.
Here's an example using lookahead/behind assertions that does the same thing as the previous example. Which is easier to comprehend - you can decide:
>>> print(re.sub(r"(?<=prefix\()(.*?)(?=\))", r" && \1 && ", target))
x = 'prefix( && "hello" && ),'
y = 'prefix( && "hi" && ),'
z = 'prefix( && "hey" && ),'
In the particular case of the question, using these assertions would lose the prefix...
text from the match and you wouldn't be able to remove it from the re.sub()
output. So in this case, they wouldn't work.
Make the use of lookarounds. The regex for this would be:
regex = r'(?<=prefix\()".*"(?=\))'
re.findall(regex, your_string)
This regex works
x = 'prefix("hello")'
y = 'prefix("hi")'
z = 'prefix("hey")'
import re
for test in (x,y,z):
print(test, re.match(r".*?\(([^\)]*)", test).group(1))
result
prefix("hello") "hello"
prefix("hi") "hi"
prefix("hey") "hey"