I have two large data frames(159,000 and 56,000 rows accordingly). Both data frames look like this:
Name | MotherName | ID
name1| Mname1. | 1
name2| Mname2. | 2'
Note that the two tables present the same people, with ID being the unique key.
I need to apply this function to them:
def sim_func(Df1_Name_List, Df2_name_list):
mini_result = pd.DataFrame(columns =[df1_name, df2_name]
for name in Df1_Name_List:
flag = False
for name2 in Df2_name_list:
if name[0] == name2[0]:
flag = True
if flag and name[0] != name2[0]:
break
if not flag and name[0] != name2[0]:
continue
if similarity_between_names_function(name, name2) >= 0.85:
mini_result.loc[len(mini_result.index)] = [name, name2]
return mini_result
As you can see, this function would normally take a ridicules amount of time (I think about 40 hours) to run on the full length of both of the lists. Therefore, I attempted to split the big data frame and the function s17 times at the same time using threading.
However, when I ran the code, it only returned 1 data frame, not 17....
I will be grateful if you could explain why.
For your convenience, the rest of the code:
def chunks (let,n):
for i in range(0, Len(list), n):
yield let[i:i+n]
def create_chunks_list(list_of_names):
list_of_chunks = []
threads_to_use = 17 # I have a 16 core processor
for each_chunk in chunks(list_of_names, int(Len(list_of_names)/threads_to_use)):
list_of_chunks.append(list(each_chunk))
return list_of_chunks
def create_table(list_of_chunks, names2_list, threads = 17):
start_time = time()
p = Pool(threads)
result_dataframes = p.starmap(sim_func, zip(list_of_chunks, (names2_list,)))
p.close()
p.join()
return pd.concat(result_dataframes)
def __main__:
names_list = df['Name'].drop_duplicates().sort_values().tolist()
names2_list = df2['Name'].drop_duplicates().sort_values()
list_of_names_chunks = create_chunks_list(names_list)
final_df = create_table(list_of_names_chunks, names2_list)
Again, the problem is that it returns only a single data frame with treading as oppose to 17 (In debug mode, the variable result_dataframes is a list of length 1)
I am sorry that I cannot add the original data, as it is confidential.
Thanks for taking the time to read my question. I know that it is a long and difficult one compare to the average that are posted on this site, so thanks for your time.
And, if by some miracle you manage to solve this, I salut you :)