Parallel linear algebra for multicore system

Question

I'm developing a program that needs to do heavy linear algebra calculations.

Now I'm using LAPACK/BLAS routines, but I need to exploit my machine (24 core Xeon X5690).

I've found projects like pblas and scalapack, but they all seem to focus on distributed computing and on using MPI.

I have no cluster available, all computations will be done on a single server and using MPI looks like an overkill.

Does anyone have any suggestion on this?

Don't write off using MPI, especially not if you can find MPI-enabled libraries to do your heavy lifting. That may be easier than finding libraries for shared-memory linear algebra or writing your own shared-memory code. A good MPI implementation will run efficiently on a shared-memory computer. — High Performance Mark, Apr 05 '12 at 09:28
I know. I didn't mean I would have written MPI code by myself. I simply wanted to avoid everything that uses MPI since this will not be a cluster program and I want to avoid any unnecessary overhead — Patrik, Apr 05 '12 at 12:24

Jonathan Dursi · Answer 1 · 2012-04-05T13:02:52.527

10

As mentioned by @larsmans (with, say, MKL), you still use LAPACK + BLAS interfaces, but you just find a tuned, multithreaded version for your platform. MKL is great, but expensive. Other, open-source, options include:

OpenBLAS / GotoBLAS, the Nehalem support should work ok but no tuned support yet for westmere. Does multithreading very well.
Atlas : automatically tunes to your architecture at installation time. probably slower for "typical" matricies (eg, square SGEMM) but can be faster for odd cases, and for westmere may even beat out OpenBLAS/GotoBLAS, haven't tested this myself. Mostly optimized for serial case, but does include parallel multithreading routines.
Plasma - LAPACK implementation designed specificially for multicore.

I'd also agree with Mark's comment; depending on what LAPACK routines you're using, the distributed memory stuff with MPI might actually be faster than the multithreaded. That's unlikely to be the case with BLAS routines, but for something more complicated (say the eigenvalue/vector routines in LAPACK) it's worth testing. While it's true that MPI function calls are an overhead, doing things in a distributed-memory mode means you don't have to worry so much about false sharing, synchronizing access to shared variables, etc.

edited Apr 05 '12 at 13:02

answered Apr 05 '12 at 12:54

Jonathan Dursi

50,107
9
127
158

I'm using dlamch, dstebz, dstein, dlamch and dstebz – Patrik Apr 07 '12 at 12:16
Maybe pdstebz and pdstein are the function I'm looking for, but they have a different interface. What are those lwork, liwork and orthogonalization parameters? What is the best way to exploit multiple cores in calling dstebz and dstein? – Patrik Apr 07 '12 at 13:00
I think the first thing is to just try with a multithreaded blas/lapack installation - say openblas or gotoblas. Getting started with scalapack takes some doing, because the distribution of the matrix between cores is non-obvious. The good news is that once you go to the trouble of using scalapack, it will work even across nodes. I gave an example of using scalapack here: http://scicomp.stackexchange.com/questions/1688/how-do-i-use-scalapack-pblas-for-matrix-vector-multiplication/1713#1713 – Jonathan Dursi Apr 07 '12 at 15:02
So I only need to build openblas, link it and I get parallel eigenvalues/eigenvectors computation? – Patrik Apr 09 '12 at 17:20
If you use one of the multicore BLAS or LAPACK packages, you'll get multithreading in whatever routines they support multithreading for. If you just take a stock lapack and use a threaded blas, the lapack routine will call threaded blas routines, which will give you some speed up, depending on how much of the work is done in the lapack, and how much in the blas calls. If that's enough improvement for you, you're done; otherwise, you may have to move to an explicitly parallel lapack, like scalapack. – Jonathan Dursi Apr 09 '12 at 17:46
I've tried to profile 3 runs, one with lapack-atlas, one with mkl and one with openblas. It turned out that the execution time was the same (more or less) for all the tests. So, either all of them are already using more cores or the eigenvalue problem solved by dstebz isn't much parallel. Mah – Patrik Apr 11 '12 at 08:23

Fred Foo · Answer 2 · 2012-11-22T00:20:19.220

2

Consider using Intel MKL. OpenBLAS can also be quite fast, though I haven't run it on > quadcore machines yet.

edited Nov 22 '12 at 00:20

answered Apr 05 '12 at 09:19

Fred Foo

355,277
75
744
836

Good call, but how to use it? Documentation show that functions have the same name as in LAPACK. Do I just need to link against mkl instead of lapack blas and gfortran? (my current ldflags are -llapack -lblas -lgfortran). Basically I need a parallel replacement for DSTEBZ and DSTEIN – Patrik Apr 05 '12 at 12:27
@Patrik: MKL is designed to be BLAS-compatible, so just recompiling and linking *should* work. I must admit I never use it directly from C, but always through Numpy. – Fred Foo Apr 05 '12 at 12:39
2

From Fortran it is, as @larsmans guesses, a straightforward operation of linking against the (right) libraries. If you are having trouble, check out the Intel link-line advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ – High Performance Mark Apr 05 '12 at 12:41
1

If you're using C, the lapacke interfaces (originally in MKL, now standard in lapack) are a very nice way to use lapack. – Jonathan Dursi Apr 05 '12 at 12:57

Parallel linear algebra for multicore system

2 Answers2

Linked