-1

So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.

dataframe example (train_original):

id name_1 name_2
0 sun blinds decoration paris inc. indl de cuautitlan sa cv
1 eih ltd. dongguan wei shi plastic new york product co., ltd.
2 jsh ltd. (hk) mexico city arab shipbuilding seoul and repair yard madrid c

common_cities list example

common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']

what is supposed to be output:

id name_1 name_2
0 sun blinds decoration inc. indl de sa cv
1 eih ltd. wei shi plastic product co., ltd.
2 jsh ltd. (hk) arab shipbuilding and repair yard c

My solution in such case worked well on small filter words list, but when it is large, the performance is low.

%%time

for city in tqdm(common_cities):
    train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
    train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
    test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)

P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.

Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?

patsvetov
  • 1
  • 1
  • 2
    Welcome to stack overflow. Please [edit] your question to include a [mcve] including samples of your input data and expected output so that we can understand how to help you. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for formatting help – G. Anderson Dec 10 '20 at 19:36
  • Would be helpful to see a sample of your dataframe. Do you have a column named "cities" and in cell values are cities as string, or perhaps a list of cities? It changes the answer dramatically. – itaishz Dec 10 '20 at 19:37
  • One place to start with optimization is to look at where you're repeating code. For example. you could pre-compile `re.compile(fr'\b({city})\b')` once instead of three times in each loop, or even compile all your cities into one regex pattern instead of looping. You could also make use of the built-in functions to pass multiple replacement items in one go instead of iterating – G. Anderson Dec 10 '20 at 19:43

1 Answers1

2

Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.

Therefore we can start by creating the dictionary and then using it with replace:

replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)

Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:

train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53