Python process dictionary using multiprocessing[Python 3.7]

Question

I am new to multiprocessing and need some help in understanding on how I can convert my current code to use multiprocessing so it can process data faster. I have the below data

accounts = [{'Id': '123',  'Email': 'Test_01@gmail.com', 'Status': 'ACTIVE'},
            {'Id': '124',  'Email': 'Test_02@gmail.com', 'Status': 'ACTIVE'},
            {'Id': '125',  'Email': 'Test_03@gmail.com', 'Status': 'ACTIVE'}]

which I need to process, currently I am using for loop to process it works perfectly fine but takes longer which is what I would like to optimize, the code looks as follows -

dl_users = {}
group_users = {}
for a in accounts:
    if a['Status'] == 'ACTIVE':
        dl_users[a['Email']] = get_dl_users(a['Email'])
        group_users[a['Email']] = get_group_users(a['Id'])

print(dl_users)
print(group_users)

Instead of using for loop I would like to populate dl_users and group_users data in parallel so when the data is in large quantity it can be processed quickly, I saw a couple of examples and tried using concurrent lib but due to lack of knowledge on multiprocessing, I have been struggling any help/guidance would greatly be appreciated.

score 0 · Answer 1 · answered Apr 07 '21 at 01:45

Multiprocessing spawns multiple python processes to act as workers. Because of this, there is no way for code in one process to access or modify variables in another process. They are totally separate and isolated from each other. There are three ways you could get around this:

You can use a multiprocessing Pipe or Queue to pass data from your worker processes back to your main process. You can't directly add to the dictionary, but you can pass individual entries back as pickled data and let the main thread un-pickle the data and store it in the dictionary. Or, you could build up a separate dictionary in each process and send them all back to merge at the end.
You could use threading instead of multiprocessing. Threading is a lot like multiprocessing except it runs its workers in separate threads within the same Python interpreter instead of separate interpreters. This means that you can access global shared variables. Usually, threading is slow because Python can only really run one thread at a time (look up the Global Interpreter Lock for more). But, in this case, it looks the threads will be spending most of their time waiting on get_dl_users and get_group_users, (which I assume are network or database operations), so you could get a lot of benefit out of multithreading.
If you are mostly waiting on IO operations, you probably don't need threads at all. You can just use Python async. This lets you run IO operations asynchronously while the rest of your code keeps running. In particular, you could use asyncio.wait like this to run all your IO operations in parallel.

Thanks for the detailed response, yes you are correct `get_dl_users` and `get_group_users` are network operations. Based on your feedback I believe multithreading would be the right way to go — Ajay Misra, Apr 07 '21 at 02:27

Python process dictionary using multiprocessing[Python 3.7]

1 Answers1

Linked