Is it worth using IPython parallel with scipy's eig?

Question

I'm writing a code which has to compute large numbers of eigenvalue problems (typical matrices dimension is a few hundreds). I was wondering whether it is possible to speed up the process by using IPython.parallel module. As a former MATLAB user and Python newbie I was looking for something similar to MATLAB's parfor...

Following some tutorials online I wrote a simple code to check if it speeds the computation up at all and I found out that it doesn't and often actually slows it down(case dependent). I think, I might be missing a point in it and maybe scipy.linalg.eig is implemented in such a way that it uses all the cores available and by trying to parallelise it i interrupt the engine management.

Here is the 'parralel' code:

import numpy as np
from scipy.linalg import eig
from IPython import parallel

#create the matrices
matrix_size = 300
matrices = {}

for i in range(100):
    matrices[i] = np.random.rand(matrix_size, matrix_size)    

rc = parallel.Client()
lview = rc.load_balanced_view()
results = {}

#compute the eigenvalues
for i in range(len(matrices)):
    asyncresult = lview.apply(eig, matrices[i], right=False)
    results[i] = asyncresult

for i, asyncresult in results.iteritems():
    results[i] = asyncresult.get()

The non-parallelised variant:

#no parallel
for i in range(len(matrices)):
    results[i] = eig(matrices[i], right=False)

The difference in CPU time for the two is very subtle. If on top of the eigenvalue problem the parallelised function has to do some more matrix operations it starts to last forever, i.e. at least 5 times longer than non-parallelised variant.

Am I right that eigenvalue problems are not really suited for this kind of parallelisation, or am I missing the whole point?

Many thanks!

EDITED 29 Jul 2013; 12:20 BST

Following moarningsun's suggestion i tried to run eig while fixing the number of threads with mkl.set_num_threads. For a 500-by-500 matrix minimum times of 50 repetitions set are the following:

No of. threads    minimum time(timeit)    CPU usage(Task Manager) 
=================================================================
1                  0.4513775764796151                 12-13%
2                  0.36869288559927327                25-27%
3                  0.34014644287680085                38-41%
4                  0.3380558903450037                 49-53%
5                  0.33508234276183657                49-53%
6                  0.3379019065051807                 49-53%
7                  0.33858615048501406                49-53%
8                  0.34488405094054997                49-53%
9                  0.33380300334101776                49-53%
10                 0.3288481198342197                 49-53%
11                 0.3512653110685733                 49-53%

Apart from one thread case there is no substantial difference (maybe 50 samples is a bit to small...). I still think I'm missing the point and a lot could be done to improve the performance, however not really sure how. These were run on a 4 cores machine with hyperthreading enabled giving 4 virtual cores.

Thanks for any input!

Aren't you having the same problem as http://stackoverflow.com/questions/16323743/ipython-parallel-not-using-multicore ? are you sure that the computation were indeed done in several core ? — hivert, Jul 24 '13 at 09:23
@hivert Thanks. But I think that all the cores are used. When viewing performance in the Windows Task Manager all eight cores jump to 100% during computation. Is it a poor indicator? — MKK_, Jul 24 '13 at 09:29
You said "maybe `scipy.linalg.eig` is implemented in such a way that it uses all the cores available" but did you actually check this? On my dualcore pc `eig` uses 99% cpu. — , Jul 24 '13 at 12:18
@moarningsun - Yes. In the problem I'm trying to solve `eig` normally uses 50-52% of CPU power available when `IPython.parallel` is not used. — MKK_, Jul 24 '13 at 14:01
It sounds like you have hyperthreading enabled, i.e. you have only 4 physical cores plus 4 virtual ones. In that case CPU use can be a bad indicator. — , Jul 25 '13 at 11:28
"moarningsun - You are right. I have 4 physical cores plus 4 virtual ones. How then one could explain that with `IPython.parallel` enbaled the CPU use jumps to 100% but the computation is slower? I mean are those virtual cores actually disturbing the eig computation that is inherently done over 4 cores without `IPython.parallel`? Thanks — MKK_, Jul 25 '13 at 11:58
I deleted my answer because I'm not too sure anymore this 'thing' can be attributed to hyperthreading. The calculation might also be memory bandwidth limited (or something else..). Maybe you could do some checks by running `eig` on 1 thread only and see how it scales with number of IPython engines? For example my scipy uses LAPACK routines from MKL so I can do `import mkl; mkl.set_num_threads(1)`. — , Jul 29 '13 at 09:07
I came across this [question](http://stackoverflow.com/questions/8416370/running-simulation-with-hyperthreading-doubles-runtime) that seems to be quite similar. It seems that indeed, hyperthreading does not really help for heavy computations. It is however still a bit unclear to me how `scipy.linalg.eig` uses the CPU resources available and whether this can be somehow improved by using tools like `IPython.parallel` appropriately... — MKK_, Jul 29 '13 at 11:47
Note it might not be possible to "fix" the number of threads MKL uses, [only "give a suggestion".](http://software.intel.com/sites/products/documentation/hpc/mkl/lin/MKL_UG_managing_performance/Using_Additional_Threading_Control.htm) Based on CPU use I would guess it uses a maximum of 4 threads in this particular case. — , Jul 29 '13 at 15:22

score 3 · Accepted Answer · 2013-08-14T13:05:40.810

Interesting problem. Because I would think it should be possible to achieve better scaling I investigated the performance with a small "benchmark". With this test I compared the performance of single and multi-threaded eig (multi-threading being delivered through MKL LAPACK/BLAS routines) with IPython parallelized eig. To see what difference it would make I varied the view type, the number of engines and MKL threading as well as the method of distributing the matrices over the engines.

Here are the results on an old AMD dual core system:

 m_size=300, n_mat=64, repeat=3
+------------------------------------+----------------------+
| settings                           | speedup factor       |
+--------+------+------+-------------+-----------+----------+
| func   | neng | nmkl | view type   | vs single | vs multi |
+--------+------+------+-------------+-----------+----------+
| ip_map |    2 |    1 | direct_view |      1.67 |     1.62 |
| ip_map |    2 |    1 |  loadb_view |      1.60 |     1.55 |
| ip_map |    2 |    2 | direct_view |      1.59 |     1.54 |
| ip_map |    2 |    2 |  loadb_view |      0.94 |     0.91 |
| ip_map |    4 |    1 | direct_view |      1.69 |     1.64 |
| ip_map |    4 |    1 |  loadb_view |      1.61 |     1.57 |
| ip_map |    4 |    2 | direct_view |      1.15 |     1.12 |
| ip_map |    4 |    2 |  loadb_view |      0.88 |     0.85 |
| parfor |    2 |    1 | direct_view |      0.81 |     0.79 |
| parfor |    2 |    1 |  loadb_view |      1.61 |     1.56 |
| parfor |    2 |    2 | direct_view |      0.71 |     0.69 |
| parfor |    2 |    2 |  loadb_view |      0.94 |     0.92 |
| parfor |    4 |    1 | direct_view |      0.41 |     0.40 |
| parfor |    4 |    1 |  loadb_view |      1.62 |     1.58 |
| parfor |    4 |    2 | direct_view |      0.34 |     0.33 |
| parfor |    4 |    2 |  loadb_view |      0.90 |     0.88 |
+--------+------+------+-------------+-----------+----------+

As you see the performance gain varies greatly over the different settings used, with a maximum of 1.64 times that of regular multi threaded eig. In these results the parfor function you used performs badly unless MKL threading is disabled on the engines (using view.apply_sync(mkl.set_num_threads, 1)).

Varying the matrix size also gives a noteworthy difference. The speedup of using ip_map on a direct_view with 4 engines and MKL threading disabled vs regular multi threaded eig:

 n_mat=32, repeat=3
+--------+----------+
| m_size | vs multi |
+--------+----------+
|     50 |     0.78 |
|    100 |     1.44 |
|    150 |     1.71 |
|    200 |     1.75 |
|    300 |     1.68 |
|    400 |     1.60 |
|    500 |     1.57 |
+--------+----------+

Apparently for relatively small matrices there is a performance penalty, for intermediate size the speedup is the largest and for larger matrices the speedup decreases again. I you could achieve a performance gain of 1.75 that would make using IPython.parallel worthwhile in my opinion.

I did some tests earlier on an Intel dual core laptop also, but I got some funny results, apparently the laptop was overheating. But on that system the speedups were generally a little lower, around 1.5-1.6 max.

Now I think the answer to your question should be: It depends. The performance gain depends on the hardware, the BLAS/LAPACK library, the problem size and the way IPython.parallel is deployed, among other things perhaps that I'm not aware of. And last but not least, whether it's worth it also depends on how much of a performance gain you think is worthwhile.

The code that I used:

from __future__ import print_function
from numpy.random import rand
from IPython.parallel import Client
from mkl import set_num_threads
from timeit import default_timer as clock
from scipy.linalg import eig
from functools import partial
from itertools import product

eig = partial(eig, right=False)  # desired keyword arg as standard

class Bench(object):
    def __init__(self, m_size, n_mat, repeat=3):
        self.n_mat = n_mat
        self.matrix = rand(n_mat, m_size, m_size)
        self.repeat = repeat
        self.rc = Client()

    def map(self):
        results = map(eig, self.matrix)

    def ip_map(self):
        results = self.view.map_sync(eig, self.matrix)

    def parfor(self):
        results = {}
        for i in range(self.n_mat):
            results[i] = self.view.apply_async(eig, self.matrix[i,:,:])
        for i in range(self.n_mat):
            results[i] = results[i].get()

    def timer(self, func):
        t = clock()
        func()
        return clock() - t

    def run(self, func, n_engines, n_mkl, view_method):
        self.view = view_method(range(n_engines))
        self.view.apply_sync(set_num_threads, n_mkl)
        set_num_threads(n_mkl)
        return min(self.timer(func) for _ in range(self.repeat))

    def run_all(self):
        funcs = self.ip_map, self.parfor
        n_engines = 2, 4
        n_mkls = 1, 2
        views = self.rc.direct_view, self.rc.load_balanced_view
        times = []
        for n_mkl in n_mkls:
            args = self.map, 0, n_mkl, views[0]
            times.append(self.run(*args))
        for args in product(funcs, n_engines, n_mkls, views):
            times.append(self.run(*args))
        return times

Dunno if it matters but to start 4 IPython parallel engines I typed at the command line:

ipcluster start -n 4

Hope this helps :)

Thanks, moarningsun. Although your answer does provide me with clear understanding of the reasons, it gives a diagnosis-based answer powered by a nice and consistent idea for investigating the effect. From what you've written and I observed, 'it depends' serves as the best answer to question posted. However, the quest for the knowledge of the intrinsic reasons makes me still a bit hungry;). — MKK_, Aug 28 '13 at 12:16

Is it worth using IPython parallel with scipy's eig?

EDITED 29 Jul 2013; 12:20 BST

1 Answers1