Numpy: Multiplying large arrays with dtype=int8 is SLOW

Question

Consider the following piece of code, which generates some (potentially) huge, multi-dimensional array and performs numpy.tensordot with it (whether we multiply the same or two different arrays here, does not really matter).

import time
import numpy

L, N = 6, 4

shape = (2*L)*[N,]
A = numpy.arange(numpy.prod(shape)).reshape(shape)
A = A % 256 - 128   # [-127,+127]
axes=(range(1,2*L,2), range(0,2*L,2))

def run(dtype, repeat=1):
    A_ = A.astype(dtype)
    t = time.time()
    for i in range(repeat):
        numpy.tensordot(A_, A_, axes)
    t = time.time() - t
    print(dtype, '   \t%8.2f sec\t%8.2f MB' %(t, A_.nbytes/1e6))

Now we can compare the performance for different data types, e.g.:

run(numpy.float64)
run(numpy.int64)

Since the array only consists of small integer numbers, I would like to save some memory by using dtype=int8. However, this slows down the matrix multiplication A LOT.

Here are some test cases

The first one, is the important one for my use case. The others are just for reference. Using Numpy 1.13.1 and Python 3.4.2

Large array

L, N = 6, 4; A.size = 4**12 = 16777216
<class 'numpy.float64'>        59.58 sec      134.22 MB
<class 'numpy.float32'>        44.19 sec       67.11 MB
<class 'numpy.int16'>         711.16 sec       33.55 MB
<class 'numpy.int8'>          647.40 sec       16.78 MB

Same array with different data types. Memory decreases as expected. But why the large differences in the CPU time? If anything I would expect int to be faster than float.

Large array with different shape

L, N = 1, 4**6; A.size = (4**6)**2 = 16777216
<class 'numpy.float64'>        57.95 sec      134.22 MB
<class 'numpy.float32'>        42.84 sec       67.11 MB

The shape doesn't seem to have a large effect.

Not so large array

L, N = 5, 4
<class 'numpy.float128'>       10.91 sec       16.78 MB
<class 'numpy.float64'>         0.98 sec        8.39 MB
<class 'numpy.float32'>         0.90 sec        4.19 MB
<class 'numpy.float16'>         9.80 sec        2.10 MB
<class 'numpy.int64'>           8.84 sec        8.39 MB
<class 'numpy.int32'>           5.55 sec        4.19 MB
<class 'numpy.int16'>           2.23 sec        2.10 MB
<class 'numpy.int8'>            1.82 sec        1.05 MB

Smaller values, but same weird trend.

small array, lots of repetitions

L, N = 2, 4; A.size = 4**4 = 256; repeat=1000000

<class 'numpy.float128'>       17.92 sec        4.10 KB
<class 'numpy.float64'>        14.20 sec        2.05 KB
<class 'numpy.float32'>        12.21 sec        1.02 KB
<class 'numpy.float16'>        41.72 sec        0.51 KB
<class 'numpy.int64'>          14.21 sec        2.05 KB
<class 'numpy.int32'>          14.26 sec        1.02 KB
<class 'numpy.int16'>          13.88 sec        0.51 KB
<class 'numpy.int8'>           13.03 sec        0.26 KB

Other than float16 being much slower, everything is fine here.

Question

Why is int8 for very large arrays so much slower? Is there any way around this? Saving memory becomes increasingly important for larger arrays!

"BLAS, optimized over decades for different archs, cpus, instructions and cache-sizes has no integer-type!" Well, that explains why (Although it opens the question why there is no interger-type in BLAS). But is there a better way in numpy? — Feodoran, Aug 03 '17 at 09:08
I'd vote for using floats, even it it means to use excessive amounts of memory. A factor of 2 or 4 doesn't sound so harsh to be impractical. — Alfe, Aug 03 '17 at 09:45
`Numpy` is built on `LAPACK` (generally), which is built on `BLAS`. Expecting `numpy` to improve the efficiency of `BLAS` is sort of like expecting an auto mechanic to improve your engine by inventing a lighter type of steel. — Daniel F, Aug 03 '17 at 10:02
@Alfe If you approach the memory limit of your machine, then a factor of 2 or 4 makes a large difference. — Feodoran, Aug 03 '17 at 10:23
@DanielF Makes sense. But maybe there is some approach other than Numpy? Could things like Cython, weave or F2PY offer better performance here? — Feodoran, Aug 03 '17 at 10:23
Everybody builds engines out of steel - in this case, that steel is `BLAS`. Someone could make a better "steel" someday, but it probably won't be someone who makes "engines" for a living. — Daniel F, Aug 03 '17 at 10:32
The thing is, what I need is not steel, but something slightly different. Having it build by someone who makes engines, might still be better than a (for my purpose not optimized) substitute built by someone who makes steel. — Feodoran, Aug 03 '17 at 11:34
Depending on your context it might be a good solution to implement your task in C(++) with Python bindings. On the C side multiplying and summing up arrays of small ints should not be a problem and be as fast as doing it with floats via BLAS. — Alfe, Aug 03 '17 at 11:59
`tensordot` reshapes and swaps and then sends the job to `np.dot`. https://stackoverflow.com/questions/44103815/why-does-numpy-einsum-work-faster-with-float32-than-float16-or-uint16 compares these speeds in `einsum`. Intel on half-precision floats: https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats — hpaulj, Aug 03 '17 at 15:47
Have a look at the accepted answer to my question that @Divakar linked here. I've gone down this path. All I can tell you is that, even if you try to use intrinsics and call avx2 operations, you'd not get as good performance as you'd get with float ops. — NULL, Aug 04 '17 at 22:22
do you know ahead of time what kind of structure your tensor is going to have? Maybe you can do some strength reduction / exploit symmetry? I don't know if the numpy functions do any of these things — evamicur, Mar 01 '18 at 18:22
Possible duplicate of [Why is it faster to perform float by float matrix multiplication compared to int by int?](https://stackoverflow.com/questions/45373679/why-is-it-faster-to-perform-float-by-float-matrix-multiplication-compared-to-int) — zvone, Mar 04 '18 at 16:36
Break it up into chunks to save memory while using the larger size to improve speed? — endolith, Oct 05 '22 at 16:26

score 3 · Accepted Answer · answered Mar 21 '18 at 11:56

Unfortunately,

as correctly underlined in the comments, the "engine" behind the scenes is BLAS, and it does not have native integer type. That's why The float64 or 32 will then run faster (some discussion in a related answer for a similar question for C++).

As a side note to the core of your question, a way to explore to speed up your problem while limiting the memory consumption is to go with Cython, where you can run C code directly and getting back the result in Python.