2

I'm trying to update a shared object (a dict) using the following code. But it does not work. It gives me the input dict as an output.

Edit: Exxentially, What I'm trying to achieve here is to append items in the data (a list) to the dict's list. Data items give indices in the dict.

Expected output: {'2': [2], '1': [1, 4, 6], '3': [3, 5]}
Note: Approach 2 raise error TypeError: 'int' object is not iterable

  1. Approach 1

    from multiprocessing import *
    def mapTo(d,tree):
            for idx, item in enumerate(list(d), start=1):
                tree[str(item)].append(idx)
    
    data=[1,2,3,1,3,1]
    manager = Manager()
    sharedtree= manager.dict({"1":[],"2":[],"3":[]})
    with Pool(processes=3) as pool:
        pool.starmap(mapTo, [(data,sharedtree ) for _ in range(3)])
    
  2. Approach 2
 from multiprocessing import *
 def mapTo(d):
         global tree
         for idx, item in enumerate(list(d), start=1):
             tree[str(item)].append(idx)

 def initializer():
      global tree
      tree = dict({"1":[],"2":[],"3":[]})
 data=[1,2,3,1,3,1]
 with Pool(processes=3, initializer=initializer, initargs=()) as pool:
     pool.map(mapTo,data)```
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
CKM
  • 1,911
  • 2
  • 23
  • 30
  • 2
    Instead of sharing a dict between processes which is a bad idea, return a dict from each process and merge them afterwards. – Joshua Nixon Apr 30 '20 at 08:25
  • why sharing a dict is a bad idea? In my case, a dict, which is kind of hash table, is really huge and I don't think returning a dict make sense. – CKM Apr 30 '20 at 08:27
  • Also, all processes are supposed to append items in the dict's list. I'm not worried about race condition here since a `Manager`'s list can be updated independently by subprocesses. – CKM Apr 30 '20 at 08:29
  • Sharing data structures between separate processes is kind of tricky. It can certainly be done, but to @JoshuaNixon point, make sure there isn't an easier way to accomplish the task at hand. – Z4-tier Apr 30 '20 at 08:30
  • Approach 2 raises error because it calls `mapTo` and in each call it passes an individual element of list – Raj Apr 30 '20 at 08:31
  • @Z4-tier. Using `Manager` we can easily share data structures. Why is it tricky to be specific? – CKM Apr 30 '20 at 08:32
  • @Raj. Will it work if I make `chunksize=3` or more? – CKM Apr 30 '20 at 08:33
  • @juanpa.arrivillaga Will moving `tree =...` within initializer work? – CKM Apr 30 '20 at 08:35
  • Also don't do 'from multiprocessing import *' (i assume you just typed up a quick example in an interpreter shell, but still...) – Z4-tier Apr 30 '20 at 08:35
  • you mean approach 2 is better but still some issue is there. Is there any possibility to make this work using `Manager` somehow? I don't know. – CKM Apr 30 '20 at 08:38
  • @juanpa.arrivillaga I moved `tree=` within initializer as also suggested [here](https://stackoverflow.com/questions/18778187/multiprocessing-pool-with-a-global-variable/18779028?noredirect=1#comment108820250_18779028) but same error. – CKM Apr 30 '20 at 08:43
  • The problem is that you are mutating the lists inside your managed dict, but that isn't going to work unless they are managed lists. – juanpa.arrivillaga Apr 30 '20 at 08:54
  • @juanpa.arrivillaga approach 1 with your suggested modification `sharedtree= manager.dict({"1":manager.list(),"2":manager.list(),"3":manager.list()})` not yielding expected o/p. – CKM Apr 30 '20 at 08:58
  • @chandresh what is your expected output? What output are you seeing? – juanpa.arrivillaga Apr 30 '20 at 08:58
  • I'm not using `init==main` thing. My o/p is same as i/p. – CKM Apr 30 '20 at 08:59
  • @chandresh see my answer. Try it on your machine. Please show your output, edit your question and add it. – juanpa.arrivillaga Apr 30 '20 at 09:01
  • @chandresh What is the use of `[(data,sharedtree ) for _ in range(3)]` inside `starmap` in approach 1? – Raj Apr 30 '20 at 09:05
  • @Raj you are free to modify code to make it work. I thought it will pass shared tree to 3 processes but it will not. – CKM Apr 30 '20 at 09:07

1 Answers1

2

You need to use managed lists if you want the changes to be reflected. So, the following works for me:

from multiprocessing import *
def mapTo(d,tree):
        for idx, item in enumerate(list(d), start=1):
            tree[str(item)].append(idx)

if __name__ == '__main__':
    data=[1,2,3,1,3,1]

    with Pool(processes=3) as pool:
        manager = Manager()
        sharedtree= manager.dict({"1":manager.list(), "2":manager.list(),"3":manager.list()})
        pool.starmap(mapTo, [(data,sharedtree ) for _ in range(3)])

    print({k:list(v) for k,v in sharedtree.items()})

This is the ouput:

{'1': [1, 1, 1, 4, 4, 4, 6, 6, 6], '2': [2, 2, 2], '3': [3, 3, 5, 3, 5, 5]}

Note, you should always use the if __name__ == '__main__': guard when using multiprocessing, also, avoid starred imports...

Edit

You have to do this re-assignment if you are on Python < 3.6, so use this for mapTo:

def mapTo(d,tree):
        for idx, item in enumerate(list(d), start=1):
            l = tree[str(item)]
            l.append(idx)
            tree[str(item)] = l

And finally, you aren't using starmap/map correctly, you are passing the data three times, so of course, everything gets counted three times. A mapping operation should work on each individual element of the data you are mapping over, so you want something like:

from functools import partial
from multiprocessing import *
def mapTo(i_d,tree):
    idx,item = i_d
    l = tree[str(item)]
    l.append(idx)
    tree[str(item)] = l

if __name__ == '__main__':
    data=[1,2,3,1,3,1]

    with Pool(processes=3) as pool:
        manager = Manager()
        sharedtree= manager.dict({"1":manager.list(), "2":manager.list(),"3":manager.list()})
        pool.map(partial(mapTo, tree=sharedtree), list(enumerate(data, start=1)))

    print({k:list(v) for k,v in sharedtree.items()})
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • o/p from your code is `{'2': [], '1': [], '3': []}`. – CKM Apr 30 '20 at 09:01
  • @chandresh I cannot reproduce. I'm getting the output above. Are you sure you are running this *exact* code? – juanpa.arrivillaga Apr 30 '20 at 09:02
  • yes. I copy pasted your code and run in spyder with py 3.5. Besides, your o/p is not my expected o/p as shown in the question at the top. – CKM Apr 30 '20 at 09:03
  • @chandresh yes, that is because you are using the `pool.map` incorrectly. But that's sort of irrelevant to your issue. The problem is that you cannot nest managed objects in Python 3.5 easily, you ahve to do a seemingly redundant re-assignment. You really should probably upgrade... – juanpa.arrivillaga Apr 30 '20 at 09:07
  • hmm. seems like assignment to dict's list in py 3.5 is 3 line code. – CKM Apr 30 '20 at 09:13
  • @chandresh yes, again, you should really upgrade, but it fixes your problem. Now, *another* issue is how you are using `map`, you are passing the data three times, so of course, everything gets counted three times. – juanpa.arrivillaga Apr 30 '20 at 09:15
  • Got it. I made two changes 1. 3 line assignment in py 3.5 and 2. `pool.starmap(mapTo,[(data, sharedtree)])`. It gives expected output. Thanks a lot. You saved 15.5 hours my quarantine hard work. – CKM Apr 30 '20 at 09:21
  • @chandresh don't use `pool.starmap(mapTo,[(data, sharedtree)])` that defeats the entire purpose of multiprocessing, your data only has one element, so it will only ever use 1 process in your pool, you might as well just forget multiprocessing. Look at my final edit – juanpa.arrivillaga Apr 30 '20 at 09:23
  • sure. one small comment: Even this line works `sharedtree = manager.dict({"1":[],"2":[],"3":[]})`. I mean `Manager` need not manage nested list. – CKM Apr 30 '20 at 09:28
  • @chandresh yes, that's true. I haven't really used a manager pre-3.6 – juanpa.arrivillaga Apr 30 '20 at 09:29
  • Hi, two Q. 1. Why iteration over data inside `mapTo` does not work? Does not manager assigns a block of data to each process so that I can iterate over? 2. How do I know one line assignment of append in dict's list in py 3.5 is not going to work. Reason behind it works in py 3.6+? – CKM Apr 30 '20 at 09:55
  • One more Q. Replacing `map` with `imap` does not work. Why so? – CKM Apr 30 '20 at 11:16