Speed up millions of regex replacements in Dataframe

Question

I have:

A location list of about 40k bigram/trigram words.
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
A Pandas DataFrame with millions of rows.

string_column	string_column_location_removed
Burger King Oakland CA	Burger King
Walmart Walnut Creek CA	Walmart

I'm currently looping through the list of locations, if the location exists in the string_column, create a new column string_column_location_removed with the location removed.

Here is my attempt, while it works, it's very slow. Any ideas on how to speed this up?

I've tried taking ideas from this and this but not sure how to really extrapolate this using a Pandas Dataframe.

from random import choice
from string import ascii_lowercase, digits
import pandas 

#making random list here 
chars = ascii_lowercase + digits
locations_lookup_list  = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
             "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df)

def location_remove(txnString):
    for locationString in locations_lookup_list:   
        if re.search(f'\\b{locationString}\\b', txnString):  
            return re.sub(f'\\b{locationString}\\b','', txnString)
        else:
            continue

df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))

[This](https://stackoverflow.com/a/42789508/3832970) is the way to go if you have a list of words/phrases to match as whole words. Why don't you simply use it with `Series.str.replace`? You do not need to run regex twice, the `if re.search(f'\\b{locationString}\\b', txnString):` check if redundant. — Wiktor Stribiżew, Oct 19 '21 at 17:35

Dani Mesejo · Accepted Answer · 2021-10-19T18:36:06.470

Use trrex, it builds an equivalent pattern as the same found in this resource (actually it is inspired by that answer):

from random import choice
from string import ascii_lowercase, digits

import pandas as pd
import trrex as tx

# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')

strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
                  "Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
                  "Thank you so Much!"] * 250000

df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")

df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)

Output

                              string_column      string_column_location_removed
0                    Burger King Oakland CA                        Burger King 
1                   Walmart Walnut Creek CA                            Walmart 
2                   Random Other Thing Here             Random Other Thing Here
3           Another random other thing here     Another random other thing here
4        Really Appreciate the help on this  Really Appreciate the help on this
...                                     ...                                 ...
1499995             Walmart Walnut Creek CA                            Walmart 
1499996             Random Other Thing Here             Random Other Thing Here
1499997     Another random other thing here     Another random other thing here
1499998  Really Appreciate the help on this  Really Appreciate the help on this
1499999                  Thank you so Much!                  Thank you so Much!

[1500000 rows x 2 columns]

Timing (of on run of str.replace)

%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The timing does not include the time needed for building the pattern.

DISCLAIMER I'm the author of trrex

Speed up millions of regex replacements in Dataframe

1 Answers1

Linked