Remove words that appear in other column, Pandas

Question

what is the procedure to remove a word from a string in one column column that occurs in the other column?

eg:

Sr       A              B                            C
1      jack        jack and jill                 and jill
2      run         you should run,               you should ,
3      fly         you shouldnt fly,there        you shouldnt ,there

It can be seen that I want column C, such that it is B minus contents of A. Please note the 3rd example, where fly is followed by a comma , so it should also take into consideration the punctuations (if the code is more towards detecting a space around it).
Column A can also have 2 words , so these need to be removed.
I need an expression in Pandas, something like:

df.apply(lambda x: x["C"].replace(r"\b"+x["A"]+r"\b", "").strip(), axis=1)

Will the column A always contain a single word? If it has more words, do we have to find the exact string as a match in column 2? Or could it be random permutations of the strings? — Alagappan Ramu, Mar 28 '14 at 13:07
exact match in case there are 2 words.. "fly there" in A should match "fly there " in B and remove it.. — Hypothetical Ninja, Mar 28 '14 at 13:08
[I had the same problem and these answers weren't working for me (got "bad escape error", but this answer worked.](https://stackoverflow.com/a/54892990/10419356) — Rhines, Nov 02 '20 at 20:12

score 5 · Answer 1 · answered Mar 28 '14 at 13:06

5

How does this look?

In [24]: df
Out[24]: 
   Sr     A                       B
0   1  jack           jack and jill
1   2   run         you should run,
2   3   fly  you shouldnt fly,there

[3 rows x 3 columns]

In [25]: df.apply(lambda row: row.B.strip(row.A), axis=1)
Out[25]: 
0                 and jill
1          you should run,
2    ou shouldnt fly,there
dtype: object

answered Mar 28 '14 at 13:06

TomAugspurger

28,234
8
86
69

should it be this way -- ?? df['C'] = your expression?? – Hypothetical Ninja Mar 28 '14 at 13:09
it looks as if this expression evaluates on an " alphabet to alphabet" basis . If there is a word say "lynch" , when it compares with "fly" , it removes "ly" from lynch .. i do not want that .. maybe a word boundary sort might help .. – Hypothetical Ninja Mar 28 '14 at 13:23
Yeah, you'll need to use a regex probably. Also to catch the punctuation correctly. I'll look at it again later. – TomAugspurger Mar 28 '14 at 13:31

score 3 · Accepted Answer · edited May 23 '17 at 12:07

3

Try this:

x['C'] = x['B'].replace(to_replace=r'\b'+x['A']+r'\b', value='',regex=True)

It was based on a previous answer and where someone told me how to do it exactly in pandas. I changed it a little to suit the current situation :)

edited May 23 '17 at 12:07

Community

1
1

answered Mar 29 '14 at 07:20

Jerry

70,495
13
100
144

Remove words that appear in other column, Pandas

2 Answers2