2

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this

    for a in accounts:
        dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
        group_users[a['Email']] = get_group_users(a['Id'], adConn)

    print(f"Users part of DL - {dl_users}")
    print(f"Users part of groups - {group_users}")
    adConn.unbind()

This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help

Ajay Misra
  • 191
  • 3
  • 13
  • Keep in mind that multiprocessing and multithreading are not the same thing. If you need to compute a lot of I/O bound tasks, such as network connections, then use threading. For CPU intensive tasks, use multiprocessing. – Xiddoc May 28 '21 at 08:31
  • Thanks for the response, yes I felt multiprocessing would be the right approach here thanks for the explanation – Ajay Misra May 28 '21 at 08:54
  • @Xiddoc what if I am using the request module to fetch data from the server and saving it to the local disk, in this case, multiprocessing or multithreading? – Aaditya Ura Jun 22 '22 at 04:09
  • @AadityaUra multithreading. Multiprocessing requires a lot of CPU power, and is usually only used for tasks that involve a lot of computing power (saving to files, reading from files, calculating math/hashes/bitcoins). For simpler I/O tasks, which you don't want to block the main thread (such as requests), you should use multithreading. – Xiddoc Jun 22 '22 at 22:29
  • @Xiddoc But this answer says, you have to use multiprocessing? `https://stackoverflow.com/a/15143994/5904928` – Aaditya Ura Jun 29 '22 at 10:16
  • @AadityaUra They mention it is better to use multithreading. Quote- "Finally, what if your code is IO bound? Then threads are just as good as processes, and with less overhead (and fewer limitations, but those limitations usually won't affect you in cases like this). Sometimes that 'less overhead' is enough to mean you don't need batching with threads, but you do with processes, which is a nice win. So, how do you use threads instead of processes? Just change `ProcessPoolExecutor` to `ThreadPoolExecutor`." – Xiddoc Jun 30 '22 at 08:52

1 Answers1

5

As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.

That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).

Multiprocessing

from multiprocessing import Pool


# Pick the amount of processes that works best for you
processes = 4

with Pool(processes) as pool:
    processed = pool.map(your_func, your_data)

Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:

processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)

Multithreading

The API for multithreading is very similar:

from concurrent.futures import ThreadPoolExecutor


# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4

with ThreadPoolExecutor(workers) as pool:
    processed = pool.map(your_func, your_data)

If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:

processed = pool.map(your_func, (account["Email"] for account in accounts))
theberzi
  • 2,142
  • 3
  • 20
  • 34
  • Thank you so much for your help. I tried using Multithreading `dl_users = pool.map(get_dl_users, [(account["Email"] for account in accounts), adConn])` how do I view the results when I try printing `dl_users` it just return `.result_iterator at 0x11382a3d0>` how can I wait for results and then proceed further? Also can you confirm if this is how `adConn` is to be passed as a parameter – Ajay Misra May 29 '21 at 03:35
  • This way you're passing the whole `[(account["email"] ...), adConn]` as the second parameter of map(). That parameter should be an iterable, but you're giving it a list *of iterables*. Check my last code example for how to pass the second argument; to pass `adConn` correctly, look at the last example under "Multiprocessing", using a lambda. – theberzi May 29 '21 at 06:57