I have:
A location list of about 40k bigram/trigram words.
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
A Pandas DataFrame with millions of rows.
string_column | string_column_location_removed |
---|---|
Burger King Oakland CA | Burger King |
Walmart Walnut Creek CA | Walmart |
I'm currently looping through the list of locations, if the location exists in the string_column
, create a new column string_column_location_removed
with the location removed.
Here is my attempt, while it works, it's very slow. Any ideas on how to speed this up?
I've tried taking ideas from this and this but not sure how to really extrapolate this using a Pandas Dataframe.
from random import choice
from string import ascii_lowercase, digits
import pandas
#making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this", "Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df)
def location_remove(txnString):
for locationString in locations_lookup_list:
if re.search(f'\\b{locationString}\\b', txnString):
return re.sub(f'\\b{locationString}\\b','', txnString)
else:
continue
df['string_column_location_removed'] = df['string_column'].apply(lambda x: location_remove(x))