0

I have two large data frames(159,000 and 56,000 rows accordingly). Both data frames look like this:

Name | MotherName | ID
name1| Mname1.    | 1
name2| Mname2.    | 2'

Note that the two tables present the same people, with ID being the unique key.

I need to apply this function to them:

    def sim_func(Df1_Name_List, Df2_name_list):
      mini_result = pd.DataFrame(columns =[df1_name, df2_name]
      for name in Df1_Name_List:
        flag = False
        for name2 in Df2_name_list:
          if name[0] == name2[0]:
            flag = True
          if flag and name[0] != name2[0]:
            break
          if not flag and name[0] != name2[0]:
            continue
          if similarity_between_names_function(name, name2) >= 0.85:
            mini_result.loc[len(mini_result.index)] = [name, name2]
      return mini_result

As you can see, this function would normally take a ridicules amount of time (I think about 40 hours) to run on the full length of both of the lists. Therefore, I attempted to split the big data frame and the function s17 times at the same time using threading.

However, when I ran the code, it only returned 1 data frame, not 17....

I will be grateful if you could explain why.

For your convenience, the rest of the code:

def chunks (let,n):
  for i in range(0, Len(list), n):
    yield let[i:i+n]

def create_chunks_list(list_of_names):
  list_of_chunks = []
  threads_to_use = 17 # I have a 16 core processor
  for each_chunk in chunks(list_of_names, int(Len(list_of_names)/threads_to_use)):
    list_of_chunks.append(list(each_chunk))
  return list_of_chunks


def create_table(list_of_chunks, names2_list, threads = 17):
  start_time = time()
  p = Pool(threads)
  result_dataframes = p.starmap(sim_func, zip(list_of_chunks, (names2_list,)))
  p.close()
  p.join() 
  return pd.concat(result_dataframes)


def __main__:
  names_list = df['Name'].drop_duplicates().sort_values().tolist()
  names2_list = df2['Name'].drop_duplicates().sort_values()
  list_of_names_chunks = create_chunks_list(names_list)
  final_df = create_table(list_of_names_chunks, names2_list)

Again, the problem is that it returns only a single data frame with treading as oppose to 17 (In debug mode, the variable result_dataframes is a list of length 1)

I am sorry that I cannot add the original data, as it is confidential.

Thanks for taking the time to read my question. I know that it is a long and difficult one compare to the average that are posted on this site, so thanks for your time.

And, if by some miracle you manage to solve this, I salut you :)

Epic_Yarin_God
  • 107
  • 1
  • 9
  • 1
    Note that multithreading CPU-bound Python code will not speed up execution, because of the GIL: https://stackoverflow.com/questions/1294382/what-is-the-global-interpreter-lock-gil-in-cpython – Jeremy Friesner Feb 28 '21 at 23:43
  • @JeremyFriesner Oh wow.... I got excited because I cut the running Time to 2 hours... I guess thats because the aforementioned problem.... I know that this might be a subject of a different post, but is there any way to overpass this? I mean, if it doesn't speed up the code, what the point of multithreading? – Epic_Yarin_God Feb 28 '21 at 23:51
  • See the last paragraph of this answer: https://stackoverflow.com/questions/45157694/a-multi-threading-example-of-the-python-gil/45158303#45158303 – Jeremy Friesner Mar 01 '21 at 00:59

0 Answers0