I have a DataFrame that looks like this:
import pandas as pd
rows = [
('chocolate', 'choco'),
('banana', pd.np.nan),
('hello world', 'world'),
('hello you', 'world'),
('hello you choco', 'world'),
('this is a very long sentence', 'very long')
]
data = pd.DataFrame.from_records(rows, columns=['origin', 'to_find'])
origin to_find
0 chocolate choco
1 banana NaN
2 hello world world
3 hello you world
4 hello you choco world
5 this is a very long sentence very long
My goal is to search the second column string within the first column, and remove it. If I do not find the substring to_find
within origin
, I replace the to_find
with a NaN
. Because this is a string operation that needs to be done row by row, I chose the apply
way. I need to extract entire words.
This is my function, that works almost* as expected, and how I apply
it:
def find_word(row):
# Handle the case where to_find is already NaN
if row.to_find is pd.np.nan:
return row
if row.to_find in row.origin:
row.origin = row.origin.replace(row.to_find, '').strip()
else:
row.to_find = pd.np.nan
return row
new_df = data.apply(find_word, axis=1)
* this code returns two spaces instead of one between this is a
and sentence
, which is not desirable.
The new_df
is expected to look like this:
origin to_find
0 late choco
1 banana NaN
2 hello world
3 hello you NaN
4 hello you choco NaN
5 this is a sentence very long
My problem is, that my original df
has millions of lines, and this particular operation takes forever with a huge DataFrame.
What would be a more performant, maybe vectorized way to tackle this?
(The .contains
method seems to work only to find one specific string in a vector, not pairwise. It was my best lead but couldn't make it work.)