I have a long list of user(about 200,000) and a corresponding data frame df
with their attributes. Now I'd like to write a for loop to measure pair-wise similarity of the users. The code is following:
df2record = pd.DataFrame(columns=['u1', 'u2', 'sim'])
for u1 in reversed(user_list):
for u2 in reversed(list(range(1, u1))):
sim = measure_sim(df[u1], df[u2]))
if sim < 0.6:
continue
else:
df2record = df2record.append(pd.Series([u1, u2, sim], index=['u1', 'u2', 'sim']), ignore_index=True)
Now I wanna run this for loop with multiprocessing and I have read some tutorial. But I still have no idea to handle it properly. Seems that I should set reasonable number of processes first, like 6
. And then I should feed each loop into one process. But the problem is how can I know the task in a certain process has been done so that a new loop can begin? Could you help me with this? Thanks you in advance!