Remove multi-word substring from string if substring in list in data frame column

Question

Asking a follow up question to my question here: Remove substring from string if substring in list in data frame column

I have the following data frame df1

       string             lists
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']
1      there is a cat     ['dog', 'house', 'car']
2      hello EVERYONE     ['hi', 'hello', 'everyone']
3      hi my name is Joe  ['name', 'was', 'is Joe']

I'm trying to return a data frame df2 that looks like this

       string             lists                         new_string
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']     I HAVE A
1      there is a cat     ['dog', 'house', 'car']       there is a cat
2      hello everyone     ['hi', 'hello', 'everyone']   
3      hi my name is Joe  ['name', 'was', 'is Joe']     hi my

The solution I was using does not work for cases where a substring is multiple words, such as pet dog or is Joe

df['new_string'] = df['string'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in df['lists'][df['string'] == x].values[0]]))

score 1 · Accepted Answer · answered Sep 19 '22 at 20:58

The question is roughly similar, but still quite different.

In this case we use re.sub over the row axis (axis=1):

df.apply(lambda row: re.sub("|".join(row["lists"]), "", row["string"], flags=re.I), axis=1)

              string                  lists      new_string
0   I HAVE A PET DOG    [fox, pet dog, cat]       I HAVE A 
1     there is a cat      [dog, house, car]  there is a cat
2     hello EVERYONE  [hi, hello, everyone]                
3  hi my name is Joe    [name, was, is Joe]         hi my

To break it down:

df.apply with axis=1 applies a function to each row
re.sub is the regex variant of str.replace
We use "|".join to make a "|" seperated string, which acts as or operator in regex. So it removes one of these words.
flags=re.I so it ignores case letters.

Note: since we use apply over the row axis, this is basically a loop in the background and thus not very optimimized.

If any of the strings in `row["lists"]` might contain special characters (for example because it is sourced from user input) you need to escape them, like so: `"|".join(re.escape(item) for item in row["lists"])`. Otherwise the regular expression will not work as expected. — Jasmijn, Sep 19 '22 at 21:03
this worked perfectly! I'm working with a pretty large dataset so I might need to think of a different way to format the df to make this more efficient — mjp, Sep 19 '22 at 21:15

Remove multi-word substring from string if substring in list in data frame column

1 Answers1