Finding substring in string within a pandas DataFrame is very slow

Question

I have a DataFrame that looks like this:

import pandas as pd

rows = [
    ('chocolate', 'choco'),
    ('banana', pd.np.nan),
    ('hello world', 'world'),
    ('hello you', 'world'),
    ('hello you choco', 'world'),
    ('this is a very long sentence', 'very long')
]
data = pd.DataFrame.from_records(rows, columns=['origin', 'to_find'])

                         origin    to_find
0                     chocolate      choco
1                        banana        NaN
2                   hello world      world
3                     hello you      world
4               hello you choco      world
5  this is a very long sentence  very long

My goal is to search the second column string within the first column, and remove it. If I do not find the substring to_find within origin, I replace the to_find with a NaN. Because this is a string operation that needs to be done row by row, I chose the apply way. I need to extract entire words.

This is my function, that works almost* as expected, and how I apply it:

def find_word(row):
    # Handle the case where to_find is already NaN
    if row.to_find is pd.np.nan:
        return row

    if row.to_find in row.origin:
        row.origin = row.origin.replace(row.to_find, '').strip()
    else:
        row.to_find = pd.np.nan

    return row

new_df = data.apply(find_word, axis=1)

* this code returns two spaces instead of one between this is a and sentence, which is not desirable.

The new_df is expected to look like this:

                origin    to_find
0                 late      choco
1               banana        NaN
2                hello      world
3            hello you        NaN
4      hello you choco        NaN
5  this is a sentence   very long

My problem is, that my original df has millions of lines, and this particular operation takes forever with a huge DataFrame.

What would be a more performant, maybe vectorized way to tackle this?

(The .contains method seems to work only to find one specific string in a vector, not pairwise. It was my best lead but couldn't make it work.)

https://stackoverflow.com/questions/54828801/how-to-subtract-string-type-columns-values-from-another-column-in-pandas — sygneto, Aug 07 '20 at 11:51
The thread is interesting for perf purpose. List comprehension does not work with NaN in the `to_find` column (I could replace all NaNs with a dumb string). But also, I don't see how the `to_find` column can be changed at the same time — pierre_loic, Aug 07 '20 at 11:58
@pierre_loic but you are replacing in `to_find` column data to nans where is no replacement in origin column? — sygneto, Aug 07 '20 at 12:14

score 0 · Answer 1 · answered Aug 07 '20 at 13:03

this solution should work for both sides, if you want to replace origin with to_find. It uses original form of 'origin' column as temp_origin but your expected output makes no sense in last line where to_find is nan.

 rows = [
        ('chocolate', 'choco'),
        ('banana', np.nan),
        ('hello world', 'world'),
        ('hello you', 'world')
    ]
    df = pd.DataFrame.from_records(rows, columns=['origin', 'to_find'])
    
df=df.fillna('')
df['temp_origin']=df['origin']

df["origin"] = df.apply(
    lambda x: x["origin"].replace(x["to_find"], ""), axis=1
)

df["to_find"] = df.apply(
    lambda x: x["to_find"].replace(x["temp_origin"], ""), axis=1
)
df=df.replace('',np.nan)
del df['temp_origin']

print(df)
      origin to_find
0       late   choco
1     banana     NaN
2     hello    world
3  hello you   world

Thank you @sygneto! Your method does not give the exact expected results, but with some tweaks I managed to make it work. And it's faster. I am going to post my detailed results. — pierre_loic, Aug 11 '20 at 08:53

pierre_loic · Accepted Answer · 2020-08-11T14:56:38.267

Update

Reading this thread and this one, I managed to ridiculously reduced the process time using list comprehensions. Here comes method_3:

def method_3(df):
    df["to_find"] = df["to_find"].fillna('')
    df['temp_origin'] = df['origin'].copy()
    
    df['origin'] = [' '.join([x for x in a.split() if x not in set(b.split())]) for a, b in zip(df['origin'], df['to_find'])]

    df['temp_origin'] = [' '.join([x for x in a.split(' ') if x not in set(b.split(' '))]) for a, b in zip(df['temp_origin'], df['origin'])]
    df['temp_origin'] = df['temp_origin'].replace('', pd.np.nan)
    
    del df['to_find']
    df.rename(columns={'temp_origin': 'to_find'}, inplace=True)
    
    return df

Now with new timings:

Method 1 took 13.820100281387568 sec.
Method 2 took 2.89176794141531 sec.
Method 3 took 0.26977075077593327 sec.

The three approaches are O(n), but it's up to 50x faster using method_3.

Original post

Largely inspired by @sygneto's answer, I managed to improve speed by almost 5 times.

Two different methods

I put my first method in a function called method_1 and the other in method_2:

def find_word(row):
    if row.to_find is pd.np.nan:
        return row

    if row.to_find in row.origin:
        row.origin = row.origin.replace(row.to_find, '').strip()
    else:
        row.to_find = pd.np.nan

    return row

def method_1(df):
    return df.apply(find_word, axis=1)

def method_2(df):
    df = df.fillna('')
    df['temp_origin'] = df['origin']
    
    df["origin"] = df.apply(lambda x: x["origin"].replace(x["to_find"], ""), axis=1)
    df["to_find"] = df.apply(lambda x: pd.np.nan if x["origin"] == (x["temp_origin"]) else x["to_find"], axis=1)
    
    del df['temp_origin']
    return df

Measure speed for both methods

To compare the time spent, I took my initial DataFrame and concated it 10000 times:

from timeit import default_timer

df = pd.concat([data] * 10000)

t0 = default_timer()
new_df_1 = method_1(df)
t1 = default_timer()

df = pd.concat([data] * 10000)

t2 = default_timer()
new_df_2 = method_2(df)
t3 = default_timer()

print(f"Method 1 took {t1-t0} sec.")
print(f"Method 2 took {t3-t2} sec.")

which outputs:

Method 1 took 11.803373152390122 sec.
Method 2 took 2.362371975556016 sec.

There is probably some space of improvements, but still a big step has been taken.

Finding substring in string within a pandas DataFrame is very slow

2 Answers2

Update

Original post

Two different methods

Measure speed for both methods