Concurrency/Parallelism on Windows with Python

Question

I developed simple program to solve eight queens problem. Now I would like to do some more testing with different meta-parameters so I would like to make it fast. I went through a few iterations of profiling and was able to cut runtime significantly but I reached the point where I believe only parts of computations concurrently could make it faster. I tried to use multiprocessing and concurrent.futures modules but it did not improve runtime a lot and in some cases even slowed down execution. That is to just give some context.

I was able to come up with similar code structure where sequential version beats concurrent.

import numpy as np
import concurrent.futures
import math
import time
import multiprocessing

def is_prime(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

def generate_data(seed):
    np.random.seed(seed)
    numbers = []
    for _ in range(5000):
        nbr = np.random.randint(50000, 100000)
        numbers.append(nbr)
    return numbers

def run_test_concurrent(numbers):
    print("Concurrent test")
    start_tm = time.time()
    chunk = len(numbers)//3
    primes = None
    with concurrent.futures.ProcessPoolExecutor(max_workers=3) as pool:
        primes = list(pool.map(is_prime, numbers, chunksize=chunk))
    print("Time: {:.6f}".format(time.time() - start_tm))
    print("Number of primes: {}\n".format(np.sum(primes)))


def run_test_sequential(numbers):
    print("Sequential test")
    start_tm = time.time()
    primes = [is_prime(nbr) for nbr in numbers]
    print("Time: {:.6f}".format(time.time() - start_tm))
    print("Number of primes: {}\n".format(np.sum(primes)))


def run_test_multiprocessing(numbers):
    print("Multiprocessing test")
    start_tm = time.time()
    chunk = len(numbers)//3
    primes = None
    with multiprocessing.Pool(processes=3) as pool:
        primes = list(pool.map(is_prime, numbers, chunksize=chunk))
    print("Time: {:.6f}".format(time.time() - start_tm))
    print("Number of primes: {}\n".format(np.sum(primes)))


def main():
    nbr_trails = 5
    for trail in range(nbr_trails):
        numbers = generate_data(trail*10)
        run_test_concurrent(numbers)
        run_test_sequential(numbers)
        run_test_multiprocessing(numbers)
        print("--\n")


if __name__ == '__main__':
    main()

When I run it on my machine - Windows 7, Intel Core i5 with four cores I got the following output:

Concurrent test
Time: 2.006006
Number of primes: 431

Sequential test
Time: 0.010000
Number of primes: 431

Multiprocessing test
Time: 1.412003
Number of primes: 431
--

Concurrent test
Time: 1.302003
Number of primes: 447

Sequential test
Time: 0.010000
Number of primes: 447

Multiprocessing test
Time: 1.252003
Number of primes: 447
--

Concurrent test
Time: 1.280002
Number of primes: 446

Sequential test
Time: 0.010000
Number of primes: 446

Multiprocessing test
Time: 1.250002
Number of primes: 446
--

Concurrent test
Time: 1.260002
Number of primes: 446

Sequential test
Time: 0.010000
Number of primes: 446

Multiprocessing test
Time: 1.250002
Number of primes: 446
--

Concurrent test
Time: 1.282003
Number of primes: 473

Sequential test
Time: 0.010000
Number of primes: 473

Multiprocessing test
Time: 1.260002
Number of primes: 473
--

The question that I have is whether I can make it somehow faster by running it concurrently on Windows with Python 3.6.4 |Anaconda, Inc.|. I read here on SO (Why is creating a new process more expensive on Windows than Linux?) that creating new processes on Windows is expensive. Is there anything that can be done to speed things up? Am I missing something obvious?

I also tried to create Pool only once but it did not seem to help a lot.

Edit:

The original code structure looks more or less like:

My code is structure more or less like this:

class Foo(object):

    def g() -> int:
        # function performing simple calculations
        # single function call is fast (~500 ms)
        pass


def run(self):
    nbr_processes = multiprocessing.cpu_count() - 1

    with multiprocessing.Pool(processes=nbr_processes) as pool:
        foos = get_initial_foos()

        solution_found = False
        while not solution_found:
            # one iteration
            chunk = len(foos)//nbr_processes
            vals = list(pool.map(Foo.g, foos, chunksize=chunk))

            foos = modify_foos()

with foos having 1000 elements. It is not possible to tell in advance how quickly algorithm converge and how many iterations are executed, possibly thousands.

score 0 · Answer 1 · answered Sep 09 '18 at 16:51

0

Processes are much more lightweight under UNIX variants. Windows processes are heavy and take much more time to start up. Threads are the recommended way of doing multiprocessing on windows. You can also follow this thread as well: Why is creating a new process more expensive on Windows than Linux?

answered Sep 09 '18 at 16:51

arshpreet

679
2
11
27

The link in your question is the same as the one in the OP's question...so it's not very helpful, IMO. – martineau Sep 09 '18 at 16:54
according to my understanding, threads are helpful with IO-bound tasks due to GIL and it is CPU-bound. Isn't that the case? – Grzegorz Sep 09 '18 at 16:55
yes it is the case for sure, Threads are light weight than processes. Try it you will see the difference. – arshpreet Sep 09 '18 at 16:57

score 0 · Accepted Answer · answered Sep 10 '18 at 00:31

0

Your setup is not really fair to multiprocessing. You even included unnecessary primes = None assignments. ;)

Some points:

Data size

Your generated data is way to litte to allow the overhead of process creation to be earned back. Try with range(1_000_000) instead of range(5000). On Linux with multiprocessing.start_method set to 'spawn' (default on Windows) this draws a different picture:

Concurrent test
Time: 0.957883
Number of primes: 89479

Sequential test
Time: 1.235785
Number of primes: 89479

Multiprocessing test
Time: 0.714775
Number of primes: 89479

Reuse your pool

Don't leave the with-block of the pool as long you have left any code in your program you want to parallelize later. If you create the pool only once at the beginning, it doesn't make much sense including the pool-creation into your benchmark at all.

Numpy

Numpy is in parts able to release the global interpreter lock (GIL). This means, you can benefit from multi-core parallelism without the overhead of process creation. If you're doing math anyway, try to utilize numpy as much as possible. Try concurrent.futures.ThreadPoolExecutor and multiprocessing.dummy.Pool with code using numpy.

answered Sep 10 '18 at 00:31

Darkonaut

20,186
7
54
65

Thanks for taking time and looking into the problem and running code on Linux. I do not have access to any Linux environment at the moment but I am curious to see how the numbers look there. The reason why I used the array with 5'000 elements is that it better reflects what happens in my original code and at the same time shows the difference in runtime. I modified the question to describe it. – Grzegorz Sep 10 '18 at 11:27
Also to your another point, I am reusing the `pool` to the extend that, IMHO, keeps code readable. Thanks to Numpy I was able to improve performance of other parts of the code significantly so I definietly appreciate the performance speed-ups that it offer but I do not think I can use it for remaining part. And at this point I am curious to find what are the limitations of multiprocessing on Windows. Could it be that for list with `1000` elements it simply does not make sense to create new process as overhead is too big? – Grzegorz Sep 10 '18 at 11:27
@Grzegorz It's not really about the list size but how long it takes to process it sequentially. If this already takes only some ms the overhead can not pay off. Not only for creating a new process but just sending work into an existing pool has it's tax, because already pickling and unpickling the data takes some time. You have to weigh serial calculation duration vs. mp-overhead to see if it's worth it. – Darkonaut Sep 10 '18 at 14:13
@Grzegorz FYI: On Linux with forking I get for range(5000) , 473 primes: Concurrent:0.013890/Sequential:0.008119/Multiprocessing:0.105427 and for range(1_000_000), 89479 primes: Concurrent:0.970164/Sequential:1.263134/Multiprocessing:0.611853 – Darkonaut Sep 10 '18 at 14:15

Concurrency/Parallelism on Windows with Python

2 Answers2

Linked