Using a regex stored as a variable in Python

Question

I read regexes and their replacements from a CSV into a dictionary and then run that over a column in a Dataframe looking for locations:

for regex, replacement in regex_replace.items():

    df["A"] = df["a"].str.replace(regex, replacement)

This works fine and successfully replaces the text. An example regex would be:

(?i)\b(maine)

However, I also want to capture the text that has been replaced from the regex match. I've tried this:

def find_match(regex, x):
    j = re.findall(r'{0}'.format(regex), x)
    return ",".join(j)

df['matches'] = df['A'].apply(lambda x: find_match(regex,str(x)))

But that doesn't find any matches - I think it's because the backslash is escaped. If I declared the regex variable as a raw string in the code, then this would work:

regex = r'(?i)\b(maine)'

However, I can't do that as it's aready stored in a variable. Is there a way to do this?

Related answers are: regex re.search is not returning the match Python Regex in Variable

I don't see how the first version works correctly. First, you're missing a `]` after `df["a"`. But more importantly, you're assigning the result to a different column than the source. So each time through the loop it processes the original source column, discarding the replacements from the previous iterations. You need to assign back to the same column. — Barmar, Aug 31 '23 at 16:19
Please show an example of `regex_replace` and the dataframe. — Barmar, Aug 31 '23 at 16:24
Is the difference between `df["A"]` and `df["a"]` intentional? — Barmar, Sep 01 '23 at 14:40

score 0 · Answer 1 · answered Aug 31 '23 at 16:52

Does your regex values include actual backlashes? If not, I think there is a way to solve this by cheating a little.

    def find_match(regex, x):
        regex_raw = regex.replace("\\", "\\\\")
        j = re.findall(regex_raw, x)
        return ",".join(j)

By replacing each backslash with a double backslash you're converting the regex to its raw string representation. But the solution assumes that the only backslashes in your regex patterns are meant to be escape sequences for regex special characters. If you have actual backlashes in your regex values, than things become a little tricky and you could implement some custom patterns for ones you want treated as literal strings.

Yeah, most of the regexes have actual backslashes in them – Tomp Sep 01 '23 at 11:24 — Tomp, Sep 01 '23 at 11:24

score -2 · Accepted Answer · edited Sep 01 '23 at 11:34

-2

One can use f-string for that.

def find_match(regex, x):
    j = re.findall(rf'{regex}', x)
    return ",".join(j)

edited Sep 01 '23 at 11:34

Cow

2,543
4
13
25

answered Aug 31 '23 at 18:02

LetzerWille

5,355
4
23
26

This worked! thank you – Tomp Sep 01 '23 at 11:23
1

`rf'{regex}'` just evaluates to a string exactly equal to `regex`. – user2357112 Sep 01 '23 at 11:43
1

@Tomp If this worked then you didn't actually have a problem in the first place. – Barmar Sep 01 '23 at 14:42
`rf'{regex}'` is also the same as `r'{0}'.format(regex)` in the OP's code. The `r` doesn't do anything in either case, since there are no escape sequences in the format string (it doesn't apply after substitution of the variable). – Barmar Sep 01 '23 at 14:44

Using a regex stored as a variable in Python

2 Answers2