In my pandas dataframe, I have a column which contains user location. I have created a function to identify the country from the location and I want to create a new column with the country name. The function is:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
loc = geolocator.geocode(str(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name
On passing any location name, the function will return the name of the country. For ex-
do_fuzzy_search("Bombay")
returns 'India'
.
I simply want to create a new column using apply function.
df['country'] = df.user_location.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
But it's taking forever to run. I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed, Speed up pandas using dask or swift and Speed up pandas using cudf.
The time taken to execute just the first 10 rows of the column using various techniques are as follows:
%%time
attractions.User_loc[:10].apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 27 ms, sys: 1.18 ms, total: 28.2 ms
Wall time: 6.59 s
0 United States of America
1 NaN
2 Australia
3 India
4 NaN
5 Australia
6 India
7 India
8 United Kingdom
9 Singapore
Name: User_loc, dtype: object
Using Swifter library:
%%time
attractions.User_loc[:10].swifter.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 1.03 s, sys: 17.9 ms, total: 1.04 s
Wall time: 7.94 s
0 United States of America
1 NaN
2 Australia
3 India
4 NaN
5 Australia
6 India
7 India
8 United Kingdom
9 Singapore
Name: User_loc, dtype: object
Using np.vectorize
%%time
np.vectorize(do_fuzzy_search)(attractions['User_loc'][:10])
CPU times: user 34.3 ms, sys: 3.13 ms, total: 37.4 ms
Wall time: 9.05 s
array(['United States of America', 'Italia', 'Australia', 'India',
'Italia', 'Australia', 'India', 'India', 'United Kingdom',
'Singapore'], dtype='<U24')
Also, used Dask's map_partitions which did not give much performance gain over the apply function.
import dask.dataframe as dd
import multiprocessing
dd.from_pandas(attractions.User_loc, npartitions=4*multiprocessing.cpu_count())\
.map_partitions(lambda df: df.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)).compute(scheduler='processes')
The computation time for 10 rows is more than 5 seconds for each technique. It's taking forever for 100k rows. I also tried to implement cudf but that's crashing my colab notebook.
What can I do to improve the performance and achieve the result in reasonable time?