-1

I read regexes and their replacements from a CSV into a dictionary and then run that over a column in a Dataframe looking for locations:

for regex, replacement in regex_replace.items():

    df["A"] = df["a"].str.replace(regex, replacement)

This works fine and successfully replaces the text. An example regex would be:

(?i)\b(maine)

However, I also want to capture the text that has been replaced from the regex match. I've tried this:

def find_match(regex, x):
    j = re.findall(r'{0}'.format(regex), x)
    return ",".join(j)

df['matches'] = df['A'].apply(lambda x: find_match(regex,str(x)))

But that doesn't find any matches - I think it's because the backslash is escaped. If I declared the regex variable as a raw string in the code, then this would work:

regex = r'(?i)\b(maine)'

However, I can't do that as it's aready stored in a variable. Is there a way to do this?

Related answers are: regex re.search is not returning the match Python Regex in Variable

Tomp
  • 35
  • 1
  • 5
  • I don't see how the first version works correctly. First, you're missing a `]` after `df["a"`. But more importantly, you're assigning the result to a different column than the source. So each time through the loop it processes the original source column, discarding the replacements from the previous iterations. You need to assign back to the same column. – Barmar Aug 31 '23 at 16:19
  • 1
    `r'{0}'.format(regex)` is just the same as `regex`. – Barmar Aug 31 '23 at 16:20
  • 1
    Please show an example of `regex_replace` and the dataframe. – Barmar Aug 31 '23 at 16:24
  • Apologies, edited the code to include the bracket – Tomp Sep 01 '23 at 11:23
  • Is the difference between `df["A"]` and `df["a"]` intentional? – Barmar Sep 01 '23 at 14:40
  • You still haven't provided any sample input data. – Barmar Sep 01 '23 at 14:41

2 Answers2

0

Does your regex values include actual backlashes? If not, I think there is a way to solve this by cheating a little.

    def find_match(regex, x):
        regex_raw = regex.replace("\\", "\\\\")
        j = re.findall(regex_raw, x)
        return ",".join(j)

By replacing each backslash with a double backslash you're converting the regex to its raw string representation. But the solution assumes that the only backslashes in your regex patterns are meant to be escape sequences for regex special characters. If you have actual backlashes in your regex values, than things become a little tricky and you could implement some custom patterns for ones you want treated as literal strings.

b1n3t
  • 1
  • 2
-2

One can use f-string for that.

def find_match(regex, x):
    j = re.findall(rf'{regex}', x)
    return ",".join(j)
Cow
  • 2,543
  • 4
  • 13
  • 25
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
  • This worked! thank you – Tomp Sep 01 '23 at 11:23
  • 1
    `rf'{regex}'` just evaluates to a string exactly equal to `regex`. – user2357112 Sep 01 '23 at 11:43
  • 1
    @Tomp If this worked then you didn't actually have a problem in the first place. – Barmar Sep 01 '23 at 14:42
  • `rf'{regex}'` is also the same as `r'{0}'.format(regex)` in the OP's code. The `r` doesn't do anything in either case, since there are no escape sequences in the format string (it doesn't apply after substitution of the variable). – Barmar Sep 01 '23 at 14:44