0

I'm running code which uses 16 processes to build up 16 dictionaries of length approximately 62,500 (that's about 1,000,000 in total). After each process finishes I update a single dictionary like this:

main_dict.update(sub_dict)

I'm finding that my code seems to hang a lot near the end of the whole script (around when I'd expect some of my processes to start returning their sub_dicts). So I'm suspecting the dictionary update.

Supposedly update needs to check every key of the sub_dict against those of the main_dict so with my example that could mean up to 62500*937500 checks for the last update, right?

Am I on the right track here? And if so, is there a way to speed things up? I know the keys are going to be unique and there will never be overlap between the sub_dicts so maybe that helps.

Alexander Soare
  • 2,825
  • 3
  • 25
  • 53
  • 1
    `.update` essentially just does `for k,v in sub_dict.items(): main_dict[k] = v`. There is no more computationally efficient way to accomplish that. I'm not sure wht you mean by "Supposedly update needs to check every key of the sub_dict against those of the main_dict so with my example that could mean up to 62500*937500 checks for the last update, right?". But to be clear, it iterates through the `sub_dict`, but it doesn't "check against the keys of `main_dict`", but not really sure if I understand you – juanpa.arrivillaga Aug 12 '20 at 20:31
  • Nope, you're right. I was thinking of it the wrong way. I suppose I'm on the wrong track then, and need to keep digging around for the error – Alexander Soare Aug 12 '20 at 20:33
  • 1
    I don't think you are on the right track but we'd need more details to know for sure. Insertion into a hash table is O(1) so this operation is linear in the number of elements. If you're not already familiar with how this works, you can find your favorite tutorial to read more. For each item to insert, you basically compute `hash(item)` and insert the item at that index in the underlying hashtable. – Micah Smith Aug 12 '20 at 20:34
  • 2
    You have 16 subprocesses feeding back into the main process? That's likely more expensive than the dictionary update part. – tdelaney Aug 12 '20 at 20:34
  • 2
    I suspect you are just seeing the `multiprocessing` overhead. Note, multiprocessing *doesn't share state*, it creates muttiple, distinct python processes, which have to explicitly serialize any state, (using pickle), send it across processes, then deserialize it in the main process. – juanpa.arrivillaga Aug 12 '20 at 20:34
  • tdelany and @juanpa.arrivillaga Actually that sounds like it could be the culprit. The values of these dicts are long lists of words from documents so they are pretty huge. Perhaps I'll open up another question. Anyone know what I should do with this one? Delete it? Answer it saying that my premise was wrong? – Alexander Soare Aug 12 '20 at 20:37
  • @AlexanderSoare You might want to check how equally you have divided up the work between your 16 processes. In your workers, print `time.time()` at the end and see if there are big differences. I am wondering whether most are idle while the last one (or few) completes - in which case, see if there are other ways to divide it up. – alani Aug 12 '20 at 20:38
  • 2
    I think @MicahSmith comment about hash insertion is O(1) answers the question of efficiency. – tdelaney Aug 12 '20 at 20:39
  • @alaniwi I've divided it equally. I split the full length up into 16 with floor division then use index offsetting in my input array. Plus they all have tqdm progress bars so I can check what's happening with each. – Alexander Soare Aug 12 '20 at 20:39
  • @MicahSmith please feel free to answer and I'll accept – Alexander Soare Aug 12 '20 at 20:40
  • multiprocessing unpickles the returned subprocess dict. It may be a little faster to return the subprocess dict.items() and use that in the update to the master dict. – tdelaney Aug 12 '20 at 20:40
  • @AlexanderSoare Okay, fair enough. It was just a possibility. I agree with others that the dictionary update doesn't sound like a particularly expensive thing. – alani Aug 12 '20 at 20:40

1 Answers1

2

The time complexity in python is documented here.

As @MicahSmith already answered (in the comments) the complexity with updating a dict is O(1) in the average and O(n) in Amortized Worst Case. This is due it iterating over the dict keys and values.

Cobalt
  • 447
  • 5
  • 9