17

I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:

import numpy as np
from numba import autojit
import numexpr as ne

a=np.random.rand(10,5000000)

# numpy
multiplication1 = np.multiply(a,a)

# numba
def multiplix(X,Y):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, N), dtype=np.float)
    for i in range(M):
        for j in range(N):
            D[i,j] = X[i, j] * Y[i, j]
    return D

mul = autojit(multiplix)
multiplication2 = mul(a,a)

# numexpr
def numexprmult(X,Y):
    M = X.shape[0]
    N = X.shape[1]
    return ne.evaluate("X * Y")

multiplication3 = numexprmult(a,a) 
ali_m
  • 71,714
  • 23
  • 223
  • 298
JEquihua
  • 1,217
  • 3
  • 20
  • 40
  • `numexpr` can outshine `numpy` for ufunc-like operations like this, especially stringing several together. Also, if you have more than one core, try setting `ne.set_num_cores(N)` where `N` is the number of cores your machine has. – askewchan Oct 09 '13 at 07:28
  • 1
    on my machine your `numexpr`-based function is about 15% slower than `np.multiply()` running on a single core, but beats it by about a factor of two when I set the number of cores to 8. Bear in mind that you may find you have to reset the core affinity of your Python process in order to use more than one core - [see my answer here](http://stackoverflow.com/a/15641148/1461210). – ali_m Oct 14 '13 at 23:11
  • You could try employing your GPU using [Theano](https://github.com/Theano/Theano). I truly don't know whether it will help and the results will depend on your exact hardware but it might be worth a shot. [Here](https://groups.google.com/forum/#!topic/theano-users/fZpCchn4JbI) you will find an example of how to do elementwise matrix multiplication using Theano. – Eiríkur Fannar Torfason Oct 18 '13 at 13:30
  • 1
    If you can, update your numpy to 1.8. (as of writing it, about to be released), that should give a simple speedup. Otherwise you will have to use somehting else that can employ SIMD instructions or can optimize to your processor. – seberg Oct 20 '13 at 09:59

5 Answers5

15

What about using and ?

elementwise.F90:

subroutine elementwise( a, b, c, M, N ) bind(c, name='elementwise')
  use iso_c_binding, only: c_float, c_int

  integer(c_int),intent(in) :: M, N
  real(c_float), intent(in) :: a(M, N), b(M, N)
  real(c_float), intent(out):: c(M, N)

  integer :: i,j

  forall (i=1:M,j=1:N)
    c(i,j) = a(i,j) * b(i,j)
  end forall

end subroutine 

elementwise.py:

from ctypes import CDLL, POINTER, c_int, c_float
import numpy as np
import time

fortran = CDLL('./elementwise.so')
fortran.elementwise.argtypes = [ POINTER(c_float), 
                                 POINTER(c_float), 
                                 POINTER(c_float),
                                 POINTER(c_int),
                                 POINTER(c_int) ]

# Setup    
M=10
N=5000000

a = np.empty((M,N), dtype=c_float)
b = np.empty((M,N), dtype=c_float)
c = np.empty((M,N), dtype=c_float)

a[:] = np.random.rand(M,N)
b[:] = np.random.rand(M,N)


# Fortran call
start = time.time()
fortran.elementwise( a.ctypes.data_as(POINTER(c_float)), 
                     b.ctypes.data_as(POINTER(c_float)), 
                     c.ctypes.data_as(POINTER(c_float)), 
                     c_int(M), c_int(N) )
stop = time.time()
print 'Fortran took ',stop - start,'seconds'

# Numpy
start = time.time()
c = np.multiply(a,b)
stop = time.time()
print 'Numpy took ',stop - start,'seconds'

I compiled the Fortran file using

gfortran -O3 -funroll-loops -ffast-math -floop-strip-mine -shared -fPIC \
         -o elementwise.so elementwise.F90

The output yields a speed-up of ~10%:

 $ python elementwise.py 
Fortran took  0.213667869568 seconds
Numpy took  0.230120897293 seconds
 $ python elementwise.py 
Fortran took  0.209784984589 seconds
Numpy took  0.231616973877 seconds
 $ python elementwise.py 
Fortran took  0.214708089828 seconds
Numpy took  0.25369310379 seconds
Alexander Vogt
  • 17,879
  • 13
  • 52
  • 68
  • 1
    Cute answer. Speed up isn't really impressive but I'm interested in playing around with this, thank you. – JEquihua Oct 20 '13 at 02:55
  • 2
    Cute answer as said JEquihua. However, to have the accurate answer, one must do a first fortran call to initialize the share library. The second call is the one that will give the most acurate answer. The speedup should be around 50%. Another way the get the most accurate is to use a loop (let say 100 calls of the same function) and take the average time. – innoSPG Dec 08 '13 at 18:27
  • Why would the speedup be around 50%? How? @innoSPG – JEquihua Dec 26 '13 at 17:35
  • @JEquihua, I forgot to mention that 50% is according to my own local test. Thank you for pointing out. It may depends on your system configuration. – innoSPG Jan 09 '14 at 20:35
6

How are you doing your timings ?

The creation of your random array is taking up the overal part of your calculation, and if you include it in your timing you will hardly see any real difference in the results, however, if you create it up front you can actually compare the methods.

Here are my results, and I'm consistently seeing what you are seeing. numpy and numba give about the same results (with numba being a little bit faster.)

(I don't have numexpr available)

In [1]: import numpy as np
In [2]: from numba import autojit
In [3]: a=np.random.rand(10,5000000)

In [4]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 90 ms per loop

In [5]: # numba

In [6]: def multiplix(X,Y):
   ...:         M = X.shape[0]
   ...:         N = X.shape[1]
   ...:         D = np.empty((M, N), dtype=np.float)
   ...:         for i in range(M):
   ...:                 for j in range(N):
   ...:                         D[i,j] = X[i, j] * Y[i, j]
   ...:         return D
   ...:         

In [7]: mul = autojit(multiplix)

In [26]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 182 ms per loop

In [27]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 185 ms per loop

In [28]: %timeit multiplication1 = np.multiply(a,a)
10 loops, best of 3: 181 ms per loop

In [29]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 179 ms per loop

In [30]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 180 ms per loop

In [31]: %timeit multiplication2 = mul(a,a)
10 loops, best of 3: 178 ms per loop

Update: I used the latest version of numba, just compiled it from source: '0.11.0-3-gea20d11-dirty'

I tested this with the default numpy in Fedora 19, '1.7.1' and numpy '1.6.1' compiled from source, linked against:

Update3 My earlier results were of course incorrect, I had return D in the inner loop, so skipping 90% of the calculations.

This provides more evidence for ali_m's assumption that it is really hard to do better than the already very optimized c code.

However, if you are trying to do something more complicated, e.g.,

np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))

I can reproduce the figures Jake Vanderplas get's:

In [14]: %timeit pairwise_numba(X)
10000 loops, best of 3: 92.6 us per loop

In [15]: %timeit pairwise_numpy(X)
1000 loops, best of 3: 662 us per loop

So it seems you are doing something that has been so far optimized by numpy it is hard to do any better.

Jens Timmerman
  • 9,316
  • 1
  • 42
  • 48
  • I am doing my timing using `%% a = np.random.rand(10,5000000) \ mul(a,a)` - the creation of the array is not included in the timed calculations. Which versions of numba and numpy are you using? – ali_m Oct 16 '13 at 16:13
  • @ali_m I answered in my post. – Jens Timmerman Oct 16 '13 at 16:44
  • Interesting... I'm beginning to suspect there may be something subtly broken about my current numba/pyllvm/llvm setup (for one thing I hit a compiler error for numba versions newer than v0.10.2). I'll dig into it - perhaps it may be relevant to what the OP is experiencing. – ali_m Oct 16 '13 at 17:42
  • I too excluded the array creation in the timing. Interesting. I have no idea why you are seeing such a huge improvement with numba. Could anyone help me get to the bottom of this? – JEquihua Oct 17 '13 at 03:35
  • 1
    @ali_m I just copy pasted the original code in ipython, which had put the return D inside the i loop, thus skipping 90% of the calculation, this makes more sense now. – Jens Timmerman Oct 17 '13 at 12:41
  • @JensTimmerman - ah, now that does make a lot more sense, although I must admit I'm still slightly disappointed that I can't have a magic 10x speed boost to core numpy functions. – ali_m Oct 17 '13 at 16:50
4

Edit: nevermind this answer, I'm wrong (see comment below).


I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.

To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:

for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack

Note that the path can vary depending on your python version. If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.

Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).

Nathan
  • 508
  • 4
  • 11
  • 2
    External BLAS/LAPACK libraries are only relevant for linear algebra operations such as _matrix_ multiplication. _Elementwise_ multiplication, as in the OP's example, uses a [`ufunc`](http://docs.scipy.org/doc/numpy/reference/ufuncs.html) written in C code that is an intrinsic component of numpy. Having said that, my feeling is that it would be asking an awful lot for either of these approaches to beat the speed of hand-written C code for something as simple as elementwise multiplication. – ali_m Oct 16 '13 at 10:12
2

use a GPU. use the following package.

gnumpy

sidquanto
  • 315
  • 3
  • 6
2

The speed of np.multiply heavily relies on the arrays beeing exactly the same size.

a = np.random.rand(80000,1)
b = np.random.rand(80000,1)

c = np.multiply(a, b)

is fast as hell whereas the following code takes more than a minute and uses up all my 16 GB of ram:

a = np.squeeze(np.random.rand(80000,1))
b = np.random.rand(80000,1)

c = np.multiply(a, b)

So my advice would be to use arrays of exactly the same dimensions. Hope this is useful for someone looking how to speed up element-wise multiplication.

bpshop
  • 21
  • 2
  • 1
    That's because the second code computes the outer product, whereas the first does element-wise multiplication. Two very different operations. The first produces an array of size (80000,), the second one of size (80000,80000). – Bob Feb 27 '21 at 01:45