Numpy dot product very slow using ints

Question

sorry for so many questions. I am running Mac OSX 10.6 on Intel core 2 Duo. I am running some benchmarks for my research and I have run into another thing that baffles me.

If I run

python -mtimeit -s 'import numpy as np; a = np.random.randn(1e3,1e3)' 'np.dot(a,a)'

I get the following output: 10 loops, best of 3: 142 msec per loop

However, if I run

python -mtimeit -s 'import numpy as np; a = np.random.randint(10,size=1e6).reshape(1e3,1e3)' 'np.dot(a,a)'

I get the following output: 10 loops, best of 3: 7.57 sec per loop

Then I ran

python -mtimeit -s 'import numpy as np; a = np.random.randn(1e3,1e3)' 'a*a' And then

python -mtimeit -s 'import numpy as np; a = np.random.randint(10,size=1e6).reshape(1e3,1e3)' 'a*a'

Both ran at about 7.6 msec per loop so it is not the multiplication. Adding had similar speeds as well, so neither of these should be affecting the dot-product, right? So why is it over 50 times slower to calculate the dot product using ints than using floats?

Same for me on Linux--I get about 3 seconds for float64 and 10 seconds for int32 (this is an older machine). Not a factor of 50, but still very strange. — Luke, Aug 08 '12 at 01:37

score 16 · Accepted Answer · answered Aug 08 '12 at 02:00

very interesting, I was curious to see how it was implemented so I did:

>>> import inspect
>>> import numpy as np
>>> inspect.getmodule(np.dot)
<module 'numpy.core._dotblas' from '/Library/Python/2.6/site-packages/numpy-1.6.1-py2.6-macosx-10.6-universal.egg/numpy/core/_dotblas.so'>
>>>

So it looks like its using the BLAS library.

so:

>>> help(np.core._dotblas)

from which I found this:

When Numpy is built with an accelerated BLAS like ATLAS, these functions are replaced to make use of the faster implementations. The faster implementations only affect float32, float64, complex64, and complex128 arrays. Furthermore, the BLAS API only includes matrix-matrix, matrix-vector, and vector-vector products. Products of arrays with larger dimensionalities use the built in functions and are not accelerated.

So it looks like ATLAS fine tunes certain functions but its only applicable to certain data types, very interesting.

so yeah it looks I'll be using floats more often ...

@Luke thanks +1 for u doing a backtrace, its also another useful method. — Samy Vilar, Aug 08 '12 at 02:06
Good to know. When I go to work I will use the same method to see if it is true for MKL as well. Thanks for the help. Luke likes this answer better so you get the accept. — Nino, Aug 08 '12 at 21:12
I love it when answers not only answer the question, but also show *how* the answer was found, introducing techniques along the way. +1. Need more answers like this around here. — naught101, Jul 11 '16 at 02:16

score 7 · Answer 2 · answered Aug 08 '12 at 01:55

Using int vs float data types causes different code paths to be executed:

The stack trace for float looks like this:

(gdb) backtr
#0  0x007865a0 in dgemm_ () from /usr/lib/libblas.so.3gf
#1  0x007559d5 in cblas_dgemm () from /usr/lib/libblas.so.3gf
#2  0x00744108 in dotblas_matrixproduct (__NPY_UNUSED_TAGGEDdummy=0x0, args=(<numpy.ndarray at remote 0x85d9090>, <numpy.ndarray at remote 0x85d9090>), 
kwargs=0x0) at numpy/core/blasdot/_dotblas.c:798
#3  0x08088ba1 in PyEval_EvalFrameEx ()
...

..while the stack trace for int looks like this:

(gdb) backtr
#0  LONG_dot (ip1=0xb700a280 "\t", is1=4, ip2=0xb737dc64 "\a", is2=4000, op=0xb6496fc4 "", n=1000, __NPY_UNUSED_TAGGEDignore=0x85fa960)
at numpy/core/src/multiarray/arraytypes.c.src:3076
#1  0x00659d9d in PyArray_MatrixProduct2 (op1=<numpy.ndarray at remote 0x85dd628>, op2=<numpy.ndarray at remote 0x85dd628>, out=0x0)
at numpy/core/src/multiarray/multiarraymodule.c:847
#2  0x00742b93 in dotblas_matrixproduct (__NPY_UNUSED_TAGGEDdummy=0x0, args=(<numpy.ndarray at remote 0x85dd628>, <numpy.ndarray at remote 0x85dd628>), 
kwargs=0x0) at numpy/core/blasdot/_dotblas.c:254
#3  0x08088ba1 in PyEval_EvalFrameEx ()
...

Both calls lead to dotblas_matrixproduct, but it appears that the float call stays in the BLAS library (probably accessing some well-optimized code), while the int call gets kicked back out to numpy's PyArray_MatrixProduct2.

So this is either a bug or BLAS just doesn't support integer types in matrixproduct (which seems rather unlikely).

Here's an easy and inexpensive workaround:

af = a.astype(float)
np.dot(af, af).astype(int)

It's worth noting that this workaround may lead to errors if your data has very large values, and will probably require copying the full matrix so is expensive if the matrix is very large. — Danica, Aug 08 '12 at 04:26
Thanks, Luke. That workaround does copy the matrices and turns out to be quite a nuisance (for memory concerns), but as far as time, it is thousands times faster for 1e4x1e4 matrices! Any larger and it is too slow to test the multiplication using ints. @Dougal This would only be true for numbers larger than 2^52 using a 64 bit float, right? The numbers will not be larger than that and I would like to take advantage of this speedup if possible. — Nino, Aug 08 '12 at 21:11
@Nino Yep, around there. It's really too bad that BLAS libraries don't work for integer types, and that numpy's built-in `dot` is so much slower. If memory issues are a problem, you could consider write a little ctypes interface that does the multiplication in [Eigen](http://eigen.tuxfamily.org/) or similar, which should be faster. — Danica, Aug 08 '12 at 21:26
The fundamental matrix-matrix-multiply in BLAS comes in 4 versions: sgemm.f, dgemm.f, cgemm.f, fzgemm.f for single, double, complex, doublecomplex respectively. Linear algebra, which includes matrix inverses and solutions to linear equations, has little use for pure integer calculations. http://www.netlib.org/blas/ — hpaulj, Aug 25 '13 at 23:10

Numpy dot product very slow using ints

2 Answers2

Linked