1

I am currently encountering the following problem. I have a method_A() that loops over a given set A1 of strings. On each of these strings I have to execute another method_B() that again returns me a set B* of strings. All the returned sets B* and set A should then be merged together in a new set called results, since the sets B* can have duplicates of certain strings.

I now want to make my method_A() faster by using multiprocessing instead of a loop. So I want to execute the method_B() for all strings of the set A at the same time.

Here is an example of what my code currently looks like:

# Method A that takes in a set of strings and returns the merged set of all sets B*
def method_A(set_A):
    # Initialize empty set to store results
    results = set()
    
    # Loop over each string in set A
    for string in set_A:

        # Execute method B
        set_B = method_B(string)
        
        # Merge set B into results set
        results = results.union(set_B)
    
    # Return the final results set
    return results

# Method B that takes in a string and returns a set of strings
def method_B(string):
    # Perform some operations on the string to generate a set of strings
    set_B = # Generated set of strings
    
    # Return the generated set
    return set_B

I never used multiprocessing but by googling my problem I found this as a possible solution to make my script faster. I tried to implement it myself with the help of ChatGPT but I'm always running into the problem that my resulting set is either empty or the multiprocessing isn't working at all. Maybe Multithreading suits this case better but I'm not sure. In general, I want to make my method_A faster. I'm open for any solution that will make it faster!

I'm glad if you can help!

SO_is_love
  • 30
  • 6

2 Answers2

3

You can replace your for loop with something like this:

Add import concurrent.futures

    with concurrent.futures.ProcessPoolExecutor() as executor:
        for set_B in executor.map(method_B, set_A):
            results = results.union(set_B)

This will create a pool of sub processes, each running its own python interpreter.

executor.map(methodB, set_A) means: for every element in set_A, execute method_B

method_B will be executed in a subprocess and several calls to method_B will be executed in parallel.

Passing values to the subprocesses and getting the return values back is transparently handled by the executor.

More details can be found in Python's documentation: concurrent.futures

MSpiller
  • 3,500
  • 2
  • 12
  • 24
  • Definitely a huge improvement! One question, what would be the difference between `ThreadPoolExecutor` and `ProcessPoolExecutor`. Maybe you could give me an example when you would use which one! :-) – SO_is_love Dec 12 '22 at 13:14
  • 1
    @SO_is_love That would be a separate question (which has already been answered here: https://stackoverflow.com/questions/51828790/what-is-the-difference-between-processpoolexecutor-and-threadpoolexecutor). Very very high-level: a python process can only execute on python code at a time, but it can wait for something _from outside_ (e.g. data from a webserver, reading from disk, ...) at the same time. When you want to execute "code" in parallel (CPU bound tasks), go for multiple processes, when the amount of input/output is the bottleneck (I/O bound tasks), go for multiple threads. – MSpiller Dec 12 '22 at 13:38
  • Can it happen that the for loop will never terminate? When I tried everything on my desktop at home, it worked perfectly, but when I tried to run it on my server it got stuck in a infinite loop. I found this article [link](https://alexwlchan.net/2019/10/adventures-with-concurrent-futures/) which shortly covers my topic but I'm unsure about it. – SO_is_love Dec 12 '22 at 16:38
-1

To solve this with threads would look something like this:

from threading import Thread

# Method A that takes in a set of strings and returns the merged set of all sets B*
def method_A(set_A):
    # Initialize empty set to store results
    results = set()
    threads = []
    
    # Start new thread for each string in set A
    for string in set_A:
        t = Thread(target=method_B, args=(string, results))
        t.start()
        threads.append(t)

    # Wait for all threads to finish
    for t in threads:
        t.join()
    
    # Return the final results set
    return results

# Method B that takes in a string and returns a set of strings
def method_B(string, results):
    # Perform some operations on the string to generate a set of strings
    set_B = # Generated set of strings
    
    # Update results dict with the generated set
    results = results.union(set_B)

Note that the thread does not return the function's value, so instead you can pass your dict to the thread and edit it there. Hope this helps!

  • Do the threads run on different processes side-stepping the GIL? https://docs.python.org/3/library/multiprocessing.html The question is about improving the speed of execution, so I just wondered... https://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python – maciek Dec 12 '22 at 09:23
  • Multithreading does not bypass the GIL, so it might not speed up your code, depending on where the bottleneck is. You could use multiprocessing instead, but I don't think you could pass the dict that easily – Daniel Robinson Dec 12 '22 at 09:43
  • So the answer should be "Multithreading will not speed up the code"? – maciek Dec 12 '22 at 09:47