2

Background

I have some code that looks like this right now.

failed_player_ids: Set[str] = set()
for player_id in player_ids:
    success = player_api.send_results(
        player_id, user=user, send_health_results=True
    )
    if not success:
        failed_player_ids.add(player_id)

This code works well but the problem is this is taking 5 seconds per call. There is a rate limit of 2000 calls per minute so i am way under the max capacity. I want to parallelize this to speed things up. This is my first time using multiprocessing library in python and hence I am a little confused as to how i should proceed. I can describe what i want to do in words.

In my current code i am loop through list of player_id and if api response is success I do nothing and if it failed i make note of that player id.

I am not sure how to implement paralleled version of this code. I have some idea but i am a little confused.

This is what i though of so far

from multiprocessing import Pool


    
    num_processors_to_use = 5 # This is a number can be increased to get more speed
    
    def send_player_result(player_id_list: List[str]) -> Optional[str]:
        for player_id in player_id_list:
            success = player_api.send_results(player_id, user=user, send_health_results=True)
            if not success:
                return player_id
    # Caller
    with Pool(processes=num_processors_to_use) as pool:
            responses = pool.map(
                func=send_player_result,
                iterable=player_id_list,
            )
            failed_player_ids = Set(responses)

 

Any comments and suggestions would help.

Unknowntiou
  • 307
  • 1
  • 12
  • Is this useful? https://stackoverflow.com/a/28463266/3216427 – joanis May 04 '21 at 20:03
  • See also https://stackoverflow.com/q/3033952/3216427 – joanis May 04 '21 at 20:04
  • @joanis Thank you the first post is a great find. I would also highly appreciate if it is not too difficult / time consuming for you if you could explain / correct my posted answer above with your comment. I feel it may help me understand better. – Unknowntiou May 04 '21 at 20:10
  • What does what you wrote actually do? Is it working yet or not? – joanis May 04 '21 at 20:16
  • PS: I've never done multiprocessor stuff in Python yet, I just recognized having reviewed questions about it recently. Hopefully someone else here will be able to comment on your code, if you indicate in what way it's not working yet. – joanis May 04 '21 at 20:19
  • PPS: I think you made a mistake in cutting and pasting the code, because you have two lines that start with `def send_player_result(` and I'm guessing the first one should not be there. – joanis May 04 '21 at 20:20
  • @joanis good catch i removed it. – Unknowntiou May 04 '21 at 20:21

1 Answers1

2

If you are using function map, then each item of the iterable player_id_list will be passed as a separate task to function send_player_result. Consequently, this function should no longer be expecting to be passed a list of player ids, but rather a single player id. And, as you know by now, if your tasks are largely I/O bound, then multithreading is a better model. You can either:

from multiprocessing.dummy import Pool
# or
from multiprocessing.pool import ThreadPool

You will probably want to greatly increase the number of threads (but not greater than the size of player_id_list):

#from multiprocessing import Pool
from multiprocessing.dummy import Pool
from typing import Set

def send_player_result(player_id):
    success = player_api.send_results(player_id, user=user, send_health_results=True)
    return success

# Only required for Windows if you are doing multiprocessing:
if __name__ == '__main__':
    
    pool_size = 5 # This is a number can be increased to get more concurrency
    
    # Caller
    failed_player_ids: Set[str] = set()
    with Pool(pool_size) as pool:
        results = pool.map(func=send_player_result, iterable=player_id_list)
        for idx, success in enumerate(results):
            if not success:
                # failed for argument player_id_list[idx]:
                failed_player_ids.add(player_id_list[idx])
            
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Thank you for such a wonderful explanation. I had a few follow up questions. What is the difference between `from multiprocessing.dummy import Pool` and doing `from multiprocessing import Pool` Lastly if i decide to use multithreading then would i use the latter option you provided which is `from multiprocessing.pool import ThreadPool` ?? – Unknowntiou May 04 '21 at 20:44
  • I found answer to my first question on this [post](https://stackoverflow.com/questions/26432411/multiprocessing-dummy-in-python-is-not-utilising-100-cpu#:~:text=threads%2C%20not%20processes%3A-,multiprocessing.,from%20fully%20utilizing%20your%20CPUs.) – Unknowntiou May 04 '21 at 20:51
  • Would i not be limited and not achieve any parallelsim by using .dummy? Also my original solution where i loop through each player id and make the api call takes 5 seconds per call. Is it safe to assume that now i make 5 calls per 5 seconds so I am essentially making 60 calls per minute? – Unknowntiou May 04 '21 at 20:55
  • I wish i could upvote your answer as well, unfortunately my reputation is only 13 points so i am not able to do that. – Unknowntiou May 04 '21 at 20:56
  • 1
    `from multiprocessing.dummy import Pool` is equivalent to `from multiprocessing.pool import ThreadPool` and uses multithreading rather than multiprocessing (the underlying class will actually be `multiprocessing.pool.ThreadPool` in the multithreading case and `multiprocessing.pool.Pool` in the multiprocessing case. If your tasks are mostly doing I/O (such as posting to a URL) your tasks are in a wait state most of the time and the known problem with multithreading, i.e. locking on the GIL is not much of a problem. Threads are much cheaper to create and easier to work with. (more) – Booboo May 04 '21 at 21:06
  • 1
    If your worker function is CPU-intensive, then you want `from multiprocessing import Pool` or `from multiprocessing.pool import Pool` (you end up with the same result). I can't make any predictions on what your throughput will be. I don't even have any idea what your worker functions is doing or whether it is really I/O or CPU-intensive. – Booboo May 04 '21 at 21:09