0

I have a function that iterates over a list of tuples (both integers) and checks if both integers exist in another list that contains integers (no tuples). If the key already exists, increment the value by one; otherwise, add the key and set the value equal to one:

def counts(double_int_list, single_int_list, dictionary_counts):

    for double_int in double_int_list:

        int1 = double_int[0]
        int2 = double_int[1]

        if int1 in single_int_list and int2 in single_int_list:

            if not dictionary_counts:                     # if dict is empty
                dictionary_counts[double_int] = 1
            elif double_int in combination_counts.keys(): # if key already exists
                dictionary_counts[double_int] += 1
            else:                                         # if key does not exist
                dictionary_counts[double_int] = 1

My goal is to use the multiprocessing module to speed this up (there are at least 4 million double integers, more in the future). I will be looping over this function multiple times, where the double_int_list remains the same but the single_int_list changes over every iteration.

I've attempted looking this up and found only one post that was relevant but there was no final answer. I have found answers to my issue of multiple arguments (can't use map directly since I have one iterable argument and one constant argument) and dealing with writing to the same dictionary (possibly multiprocessing.Queue(), Manger(), Lock(), and/or SyncManager()). But, I'm very new to multiprocessing and I'm having a hard time putting everything together. It's possible that this type of function may not be able to be split or save much time but any advice/suggestions/comments is appreciated! If more information is needed, let me know.

Community
  • 1
  • 1

1 Answers1

1

you might be able to get some speedups using set lookup instead of lists as use the defaultdict for counting as so:

from collections import defaultdict

def counts(double_int_list, single_int_set):
    countdict = defaultdict(int)
    for  double_int in double_int_list:
        int1, int2 = doubleint
        if int1 in single_int_set and int2 in single_int_set:
            countdict[doubleint] += 1

    return countdict

If you do want to do multiprocessing I would split the list into chunks and parallelize over the chunks:

import muliptocessing
pool = multiprocessing.Pool(4) # 4cpus
# referenceset and chunkedlist need to be created
results = p.map(lambda x: counts(x, singleintset=referenceset), chunkedlist)

results = list(results)
#results are now a list of count dictionaries that need to be combined.
# for that job I would recommend the merge command of the pytoolz library
Community
  • 1
  • 1
zach
  • 29,475
  • 16
  • 67
  • 88
  • Thank you for your quick reply, @zach. I had done something similar to what you mentioned in your first suggestion but I need to maintain the same dictionary when I iterate over the function, which is why I don't return anything within the function. Also, I need to maintain the tuples as keys, the relationship between the two integers matter. – sarahbusby Feb 17 '16 at 20:26
  • glad to be of help. I changed the structure with an eye toward multiprocessing. each thread can be independent and only after the work is done do you combine. – zach Feb 17 '16 at 20:59