So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities
. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original)
:
id | name_1 | name_2 |
---|---|---|
0 | sun blinds decoration paris inc. | indl de cuautitlan sa cv |
1 | eih ltd. dongguan wei shi | plastic new york product co., ltd. |
2 | jsh ltd. (hk) mexico city | arab shipbuilding seoul and repair yard madrid c |
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output
:
id | name_1 | name_2 |
---|---|---|
0 | sun blinds decoration inc. | indl de sa cv |
1 | eih ltd. wei shi | plastic product co., ltd. |
2 | jsh ltd. (hk) | arab shipbuilding and repair yard c |
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?