For the topic np.einsum
, I already read a bunch of discussions at:
In order to understand more about why is np.eimsum
faster than the usual np.sum
, np.product
, and etc (even for the latest numpy version in anaconda), I'm using np.einsum_path
to see what did the optimization process optimize.
When doing so, I found an interesting phenomenon. Consider this minimal example:
import numpy as np
for i in 'int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64'.split():
print(np.einsum_path('i->', np.empty(2**30, i))[1])
The outputs are all identical:
Complete contraction: i->
Naive scaling: 1
Optimized scaling: 1
Naive FLOP count: 1.074e+09
Optimized FLOP count: 2.147e+09
Theoretical speedup: 0.500
Largest intermediate: 1.000e+00 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
1 i-> ->
Where the Optimized FLOP increases (which means more computation?) and the Theoretical speedup is smaller than 1 (which means slower). But if we actually time the computation:
for i in 'int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64'.split():
a = np.empty(2**27, i)
raw = %timeit -qon9 a.sum()
noOp = %timeit -qon9 np.einsum('i->', a, optimize=False)
op = %timeit -qon9 np.einsum('i->', a, optimize='greedy')
print(i, raw.average/op.average, noOp.average/op.average, sep='\t')
If we look at the second column corresponding to "native" timing divided by optimized timing, they are all close to 1 which means optimization didn't make it slower:
int8 4.485133392283354 1.0205873691331475
int16 3.7817373109729213 0.9528030137222752
int32 1.3760725925789292 1.0741615462167338
int64 1.0793509548186524 1.0076602576129605
uint8 4.509893894635594 0.997277624256872
uint16 3.964949791428885 0.9914991211913878
uint32 1.3054813163356085 1.009475242303559
uint64 1.0747670688044795 1.0082522386805526
float32 2.4105510701565636 0.9998241152368149
float64 2.1957241421227556 0.9836838487664662
I'm wondering wound would np.einsum_path
say it takes more FLOP and its slower? I believe timing is directly computed from number of FLOPs so these two benchmarks are basically referring to the same thing.
BTW, I'm attaching a example showing how np.einsum_path
"usually" behave to convince the result above to be abnormal:
a = np.empty((64, 64))
print(np.einsum_path('ij,jk,kl->il', a, a, a)[1])
noOp = %timeit -qon99 np.einsum('ij,jk,kl->il', a, a, a, optimize=False)
op = %timeit -qon99 np.einsum('ij,jk,kl->il', a, a, a, optimize='greedy')
print('Actual speedup:', noOp.average / op.average)
which outputs:
Complete contraction: ij,jk,kl->il
Naive scaling: 4
Optimized scaling: 3
Naive FLOP count: 5.033e+07
Optimized FLOP count: 1.049e+06
Theoretical speedup: 48.000
Largest intermediate: 4.096e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
3 jk,ij->ik kl,ik->il
3 ik,kl->il il->il
Actual speed up: 90.33518444642904