I am using fuzzywuzzy in python for fuzzy string matching. I have a set of names in a list named HKCP_list which I am matching against a pandas column iteratively to get the best possible match. Given below is the code for it
import fuzzywuzzy
from fuzzywuzzy import fuzz,process
def search_func(row):
chk = process.extract(row,HKCP_list,scorer=fuzz_token_sort_ratio)[0]
return chk
wc_df['match']=wc_df['concat_name'].map(search_func)
The wc_df dataframe contains the column 'concat_name' which needs to be matched with every name in the list HKCP_list. The above code took around 2 hours to run with 6K names in the list and 11K names in the column 'concat_name'.
I have to rerun this on another data set where are 89K names in the list and 120K names in the column. In order to speed up the process, I got an idea in the following question on Stackoverflow
Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column
In one of the comments in the answer in the above, it has been advised to compare names that have the same 1st letter. The 'concat_name' column that I am comparing with is a derived column obtained by concatenating 'first_name' and 'last_name' columns in the dataframe. Hence I am using the following function to match the 1st letter (since this is a token sort score that I am considering, I am comparing the 1st letter of both the first_name and last_name with the elements in the list). Given below is the code:
wc_df['first_name_1stletter'] = wc_df['first_name'].str[0]
wc_df['last_name_1stletter'] = wc_df['last_name'].str[0]
import time
start_time=time.time()
def match_func(row):
CP_subset=[x for x in HKCP_list if x[0]==row['first_name_1stletter'] or x[0]==row['last_name_1stletter']]
return CP_subset
wc_df['list_to_match']=wc_df.apply(match_func,axis=1)
end_time=time.time()
print(end_time-start_time)
The above step took 1600 second with 6K X 11K data. The 'list_to_match' column contains the list of names to be compared for each concat_name. Now here I have to again take the list_to_match element and pass individual elements in a list and do the fuzzy string matching using the process.extract method. Is there a more elegant and faster way of doing this in the same step as above?
PS: Editing this to add an example as to how the list and the dataframe column looks like.
HKCp_list=['jeff bezs','michael blomberg','bill gtes','tim coook','elon musk']
concat_name=['jeff bezos','michael bloomberg','bill gates','tim cook','elon musk','donald trump','kim jong un', 'narendra modi','michael phelps']
first_name=['jeff','michael','bill','tim','elon','donald','kim','narendra','michael']
last_name=['bezos','bloomberg','gates','cook','musk','trump','jong un', 'modi','phelps']
import pandas as pd
df=pd.DataFrame({'first_name':first_name,'last_name':last_name,'concat_name':concat_name})
Each row of the 'concat_name' in df has to be compared against the elements of HKcp_list.
PS: editing today to reflect the ":" and the row in the 2nd snippet of code I missed yesterday
Regards, Nirvik