Pandas: Remove all words from specific list within dataframe strings in large dataset

Question

So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.

dataframe example (train_original):

id	name_1	name_2
0	sun blinds decoration paris inc.	indl de cuautitlan sa cv
1	eih ltd. dongguan wei shi	plastic new york product co., ltd.
2	jsh ltd. (hk) mexico city	arab shipbuilding seoul and repair yard madrid c

common_cities list example

common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']

what is supposed to be output:

id	name_1	name_2
0	sun blinds decoration inc.	indl de sa cv
1	eih ltd. wei shi	plastic product co., ltd.
2	jsh ltd. (hk)	arab shipbuilding and repair yard c

My solution in such case worked well on small filter words list, but when it is large, the performance is low.

%%time

for city in tqdm(common_cities):
    train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
    train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
    test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)

P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.

Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?

Welcome to stack overflow. Please [edit] your question to include a [mcve] including samples of your input data and expected output so that we can understand how to help you. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for formatting help — G. Anderson, Dec 10 '20 at 19:36
Would be helpful to see a sample of your dataframe. Do you have a column named "cities" and in cell values are cities as string, or perhaps a list of cities? It changes the answer dramatically. — itaishz, Dec 10 '20 at 19:37
One place to start with optimization is to look at where you're repeating code. For example. you could pre-compile `re.compile(fr'\b({city})\b')` once instead of three times in each loop, or even compile all your cities into one regex pattern instead of looping. You could also make use of the built-in functions to pass multiple replacement items in one go instead of iterating — G. Anderson, Dec 10 '20 at 19:43

score 2 · Answer 1 · answered Dec 10 '20 at 19:42

Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.

Therefore we can start by creating the dictionary and then using it with replace:

replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)

Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:

train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')

Pandas: Remove all words from specific list within dataframe strings in large dataset

1 Answers1