9

I really want to know how to utilize multi-core processing for matrix multiplication on numpy/pandas.

What I'm trying is here:

M = pd.DataFrame(...) # super high dimensional square matrix.
A = M.T.dot(M) 

This takes huge processing time because of many sums of products, and I think it's straightforward to use multithreading for huge matrix multiplication. So, I was googling carefully, but I can't find how to do that on numpy/pandas. Do I need to write multi thread code manually with some python built-in threading library?

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Light Yagmi
  • 5,085
  • 12
  • 43
  • 64
  • 1
    Don't do this in pandas (if it's matrix manipulation), just stay in numpy. Does this function `lambda x: x.T.dot(x)` have another name (it may already have a numpy function already which you can call with numba or something)... – Andy Hayden Apr 04 '14 at 01:34
  • 2
    On my fedora 20 python/numpy install, I see multiple cores used on a large `x.T.dot(x)` calc. My percent of CPU for the whole script including creating the matrix was 282%. Is multi-core support in this situation a function of what libraries numpy links to? – Karl D. Apr 04 '14 at 02:23
  • 2
    You have to have your numpy compiled with Intel's MKL library. You can check by `import numpy as np; np.show_config()` – Martin Apr 04 '14 at 07:18
  • 1
    Thank you guys. I'm sorry I failed to tell my environment, Mac OSX. I tried MKL library (anaconda bundle), but it seems not to use multicore (just cpu 100%). I guess only a linux user can enjoy multi core feature of MKL because [this benchmark](http://stackoverflow.com/questions/5260068/multithreaded-blas-in-python-numpy) treat OSX as single core. – Light Yagmi Apr 06 '14 at 22:47
  • @Martin There are many more multithreaded BLAS implementations than just MKL - I would highly recommend [OpenBLAS](http://www.openblas.net/), which is open source and achieves comparable performance to the proprietary MKL library. [ATLAS](http://math-atlas.sourceforge.net/) is another option, although my experience has been that it's slower and way more of a pain to compile. – ali_m May 10 '14 at 19:44
  • @ali_m: True. I don't have any experience with other multithreaded implementations of BLAS because I use MKL also in other ways (via numexpr where I believe is no other alternative). – Martin May 12 '14 at 11:35
  • @Martin Well MKL isn't strictly required for `numexpr` either, although you're right in the sense that a certain subset of optimizations are only possible with Intel's VML. Anyway, the point is that there's no fundamental reason why the OP couldn't use one of a number of different multithreaded BLAS libraries to accelerate dot products. – ali_m May 12 '14 at 13:28
  • @ali_m: I was just explaining why I'm stuck in the MKL-only thinking. And in fact, I didn't know there are other BLAS implementations which can compete with MKL. Thanks for pointing that out. – Martin May 12 '14 at 14:57

2 Answers2

3

In NumPy, multithreaded matrix multiplication can be achieved with a multithreaded implementation of BLAS, the Basic Linear Algebra Subroutines. You need to:

  1. Have such a BLAS implementation; OpenBLAS, ATLAS and MKL all include multithreaded matrix multiplication.
  2. Have a NumPy compiled to use such an implementation.
  3. Make sure the matrices you're multiplying both have a dtype of float32 or float64 (and meet certain alignment restrictions; I recommend using NumPy 1.7.1 or later where these have been relaxed).

A few caveats apply:

  • Older versions of OpenBLAS, when compiled with GCC, runs into trouble in programs that use multiprocessing, which includes most applications that use joblib. In particular, they will hang. The reason is a bug (or lack of a feature) in GCC. A patch has been submitted but not included in the mainline sources yet.
  • The ATLAS packages you find in a typical Linux distro may or may not be compiled to use multithreading.

As for Pandas: I'm not sure how it does dot products. Convert to NumPy arrays and back to be sure.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
2

First of all I would also propose to convert to bumpy arrays and use numpys dot function. If you have access to an MKL build which is more or less the fastest implementation at the moment, you should try to set the environment variable OMP_NUM_THREADS. This should activate the other cores of your system. On my MAC it seems to work properly. In addition I would try to use np.einsum which seems to be faster than np.dot

But pay attention! If you have compiled an multithreaded library that is using OpenMP for parallelisation (like MKL), you have to consider, that the "default gcc" on all apple systems is not gcc, it is Clang/LLVM and Clang ist not able to build with OpenMP support at the moment, except you use the OpenMP trunk which is still experimental. So you have to install the intel compiler or any other that supports OpenMP

lemitech
  • 107
  • 1
  • 4