4

I want to do some large matrix multiplications using multiprocessing.Pool.

Suddenly, when the dimension is higher than 50, it takes an extremely long computation time.

Is there any easy way to be faster?

Here, I don't want to use shared memory like RawArray, because my original code randomly generate the matrix for each time.

The sample code is as follows.

import numpy as np
from time import time
from multiprocessing import Pool
from functools import partial

def f(d):
    a = int(10*d)
    N = int(10000/d)
    for _ in range(N):
        X = np.random.randn(a,10) @ np.random.randn(10,10)
    return X

# Dimensions
ds = [1,2,3,4,5,6,8,10,20,35,40,45,50,60,62,64,66,68,70,80,90,100]

# Serial processing
serial = []
for d in ds:
    t1 = time()
    for i in range(20):
        f(d)
    serial.append(time()-t1)

# Parallel processing
parallel = []
for d in ds:
    t1 = time()
    pool = Pool()
    for i in range(20):
        pool.apply_async(partial(f,d), args=())
    pool.close()
    pool.join()
    parallel.append(time()-t1)

# Plot
import matplotlib.pyplot as plt
plt.title('Matrix multiplication time with 10000/d repetitions')
plt.plot(ds,serial,label='serial')
plt.plot(ds,parallel,label='parallel')
plt.xlabel('d (dimension)') 
plt.ylabel('Total time (sec)')
plt.legend()
plt.show()

Due to the total computation cost of f(d) is the same for all d, the parallel processing time should be equal.

But the actual output is not.

enter image description here

System info:

Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz

NOTE I want to use parallel computation as a complicated internal simulation (like @), not sending data to the child process.

Seung Hyeon Yu
  • 178
  • 1
  • 8
  • I wonder how it works at all! If you add `print("foo")` as the very first line (before `import numpy as np`), how many times it is printed? – aparpara Apr 19 '19 at 10:31
  • Only once.```foo``` – Seung Hyeon Yu Apr 19 '19 at 11:01
  • 2
    This could to be system or hardware related ; maybe system memory paging ? I have a different [output](https://imgur.com/1IO3aTj) using python3.5, debian – Diane M Apr 19 '19 at 11:18
  • 1
    Possible duplicate of [Multiprocessing.Pool makes Numpy matrix multiplication slower](https://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower) – jwalton Apr 19 '19 at 11:28
  • Comparing OPs plot with Arthur's it's clear that one needs to be careful what to look for here in what context. If the use-case is fast matrix-multiplication, don't use any python-based parallelization and use a numpy-setup with multithreaded BLAS-backend, which (probably) will be impossible to beat (much more lightweight parallelism). Analyzing the jump of the parallel approach might be an interesting task, but will never lead to better real-world performance (compared to BLAS). *Some* peaks (also in serial mode) might be due to different BLAS-rel code-paths / loop-unrolling, caches and co. – sascha Apr 19 '19 at 11:47
  • What's curious is the `@` operator in the `f` function for `X = np.random.randn(a,10) @ np.random.randn(10,10)`. I have no idea what that does. – lucasgcb Apr 19 '19 at 12:05
  • @Arthur. It's interesting. I should definitely check my hardware. Thank you. – Seung Hyeon Yu Apr 19 '19 at 12:06
  • @lucasgcb. `@` is a matrix multiplication which is equivalent to `np.dot`. – Seung Hyeon Yu Apr 19 '19 at 12:10
  • @SeungHyeonYu I got it, you are under Linux. Under Windows each process imports the main module, so you have to protect it with `if __name__ == '__main__':` – aparpara Apr 19 '19 at 12:15
  • And what does `np.show_config()` print? I suspect it is an Anaconda-related issue. It has some twists. – aparpara Apr 19 '19 at 12:19
  • @ArthurHavlicek, I have output similar to yours. Python 3.7.3, Windows 10 x64 – aparpara Apr 19 '19 at 12:23
  • @Ralph, [your issue](https://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower) have a problem with the sending data from the parent to child process, whereas my problem is related to the internal computation itself, – Seung Hyeon Yu Apr 19 '19 at 12:25
  • Could you plot an average of several attempts to exclude random factors? – ivan_pozdeev Apr 19 '19 at 13:46

2 Answers2

1

This is for self-reference.

Here, I found a solution.

My numpy uses MKL as backend, it may be the problem that MKL multithreading collides multiprocessing.

If I run the code:

import os
os.environ['MKL_NUM_THREADS'] = '1'

before importing numpy, then it solved.

enter image description here

Seung Hyeon Yu
  • 178
  • 1
  • 8
0

I just found an explanation here: https://github.com/numpy/numpy/issues/10145. Looks like the CPU caching gets messed up when you have conflicting MKL matrix multiplications going at the same time.

Shane P Kelly
  • 221
  • 1
  • 8