So I've looked at both the documentation of the multiprocessing module, and also at the other questions asked here, and none seem to be similar to my case, hence I started a new question.
For simplicity, I have a piece of code of the form:
# simple dataframe of some users and their properties.
data = {'userId': [1, 2, 3, 4],
'property': [12, 11, 13, 43]}
df = pd.DataFrame.from_dict(data)
# a function that generates permutations of the above users, in the form of a list of lists
# such as [[1,2,3,4], [2,1,3,4], [2,3,4,1], [2,4,1,3]]
user_perm = generate_permutations(nr_perm=4)
# a function that computes some relation between users
def comp_rel(df, permutation, user_dict):
df1 = df.userId.isin(permutation[0])
df2 = df.userId.isin(permutation[1])
user_dict[permutation[0]] += permutation[1]
return user_dict
# and finally a loop:
user_dict = defaultdict(int)
for permutation in user_perm:
user_dict = comp_rel(df, permutation, user_dict)
I know this code makes very little (if any) sense right now, but I just wrote a small example that is close to the structure of the actual code that I am working on. That user_dict
should finally contain userIds
and some value.
I have the actual code, and it works fine, gives the correct dict and everything, but... it runs on a single thread. And it's painfully slow, keeping in mind that I have another 15 threads totally free.
My question is, how can I use the multiprocessing
module of python to change the last for loop, and be able to run on all threads/cores available? I looked at the documentation, it's not very easy to understand.
EDIT: I am trying to use pool as:
p = multiprocessing.Pool(multiprocessing.cpu_count())
p.map(comp_rel(df, permutation, user_dict), user_perm)
p.close()
p.join()
however this breaks because I am using the line :
user_dict = comp_rel(df, permutation, user_dict)
in the initial code, and I don't know how these dictionaries should be merged after pool is done.