How does architecture affect numpy array operation performance?

Question

I have Ubuntu 14.04 with an "Anaconda" Python distribution with Intel's math kernel library (MKL) installed. My processor is an Intel Xeon with 8 cores and without Hyperthreading (so only 8 threads).

For me numpy tensordot consistently outperforms einsum for large arrays. However, others have found very little difference between the two or even that einsum may outperform numpy for some operations.

For people with a numpy distribution built with a fast library, I am wondering why this might happen. Does MKL run more slowly on non-Intel processors? Or does einsum run faster on more modern Intel processors with better threading capabilities?

Here is a quick example code to compare performance on my machine:

In  [27]: a = rand(100,1000,2000)

In  [28]: b = rand(50,1000,2000)

In  [29]: time cten = tensordot(a, b, axes=[(1,2),(1,2)])
CPU times: user 7.85 s, sys: 29.4 ms, total: 7.88 s
Wall time: 1.08 s

In  [30]: "FLOPS TENSORDOT: {}.".format(cten.size * 1000 * 2000 / 1.08)
Out [30]: 'FLOPS TENSORDOT: 9259259259.26.'

In  [31]: time cein = einsum('ijk,ljk->il', a, b)
CPU times: user 42.3 s, sys: 7.58 ms, total: 42.3 s
Wall time: 42.4 s

In  [32]: "FLOPS EINSUM: {}.".format(cein.size * 1000 * 2000 / 42.4)
Out [32]: 'FLOPS EINSUM: 235849056.604.'

Tensor operations with tensordot run consistently in the 5-20 GFLOP range. I only get 0.2 GFLOPS with einsum.

"einsum may outperform numpy for some operations." That comparison is not valid. einsum is a very generic function, that can be used for a number of operations. You can't compare, say, simply copying an array with einsum with the complexity of a tensor product. — rth, Jul 02 '15 at 17:33
True. I was interested in comparing the efficiency of `einsum` for tensor products that could also be done using `tensordot`. That comparison does have some other comparisons which are a bit off topic here. Thanks! — Will Martin, Jul 06 '15 at 17:09

rth · Accepted Answer · 2015-07-02T17:57:39.360

Essentially you are comparing two very different things:

np.einsum calculates the tensor product with for loops in C. It has has some SIMD optimizations but is not multi-threaded and does not use MLK.
np.tensordot, which consists in reshaping/broadcasting the input arrays and then calling BLAS (MKL, OpenBLAS, etc) for the matrix multiplication. The reshaping/broadcasting phase result in some additional overhead, however the matrix multiplication is extremely well optimized with SIMD, some assembler and multi-threading.

As a result, tensordot will be generally faster than einsum in single core execution, unless small array sizes are used (and then the reshaping/broadcasting overhead becomes non negligible). This is even more true since the former approach is multi-threaded, while the later is not.

In conclusion, the results you are getting are perfectly normal and would probably be generally true (Intel/non-Intel CPU, modern or not, multi-core or not, using MKL or OpenBLAS, etc).

How does architecture affect numpy array operation performance?

1 Answers1