0

For the topic np.einsum, I already read a bunch of discussions at:

In order to understand more about why is np.eimsum faster than the usual np.sum, np.product, and etc (even for the latest numpy version in anaconda), I'm using np.einsum_path to see what did the optimization process optimize.

When doing so, I found an interesting phenomenon. Consider this minimal example:

import numpy as np
for i in 'int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64'.split():
    print(np.einsum_path('i->', np.empty(2**30, i))[1])

The outputs are all identical:

  Complete contraction:  i->
         Naive scaling:  1
     Optimized scaling:  1
      Naive FLOP count:  1.074e+09
  Optimized FLOP count:  2.147e+09
   Theoretical speedup:  0.500
  Largest intermediate:  1.000e+00 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   1                         i->                                       ->

Where the Optimized FLOP increases (which means more computation?) and the Theoretical speedup is smaller than 1 (which means slower). But if we actually time the computation:

for i in 'int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64'.split():
    a    = np.empty(2**27, i)
    raw  = %timeit -qon9 a.sum()
    noOp = %timeit -qon9 np.einsum('i->', a, optimize=False)
    op   = %timeit -qon9 np.einsum('i->', a, optimize='greedy')
    print(i, raw.average/op.average, noOp.average/op.average, sep='\t')

If we look at the second column corresponding to "native" timing divided by optimized timing, they are all close to 1 which means optimization didn't make it slower:

int8    4.485133392283354   1.0205873691331475
int16   3.7817373109729213  0.9528030137222752
int32   1.3760725925789292  1.0741615462167338
int64   1.0793509548186524  1.0076602576129605
uint8   4.509893894635594   0.997277624256872
uint16  3.964949791428885   0.9914991211913878
uint32  1.3054813163356085  1.009475242303559
uint64  1.0747670688044795  1.0082522386805526
float32 2.4105510701565636  0.9998241152368149
float64 2.1957241421227556  0.9836838487664662

I'm wondering wound would np.einsum_path say it takes more FLOP and its slower? I believe timing is directly computed from number of FLOPs so these two benchmarks are basically referring to the same thing.

BTW, I'm attaching a example showing how np.einsum_path "usually" behave to convince the result above to be abnormal:

a = np.empty((64, 64))
print(np.einsum_path('ij,jk,kl->il', a, a, a)[1])
noOp = %timeit -qon99 np.einsum('ij,jk,kl->il', a, a, a, optimize=False)
op   = %timeit -qon99 np.einsum('ij,jk,kl->il', a, a, a, optimize='greedy')
print('Actual speedup:', noOp.average / op.average)

which outputs:

  Complete contraction:  ij,jk,kl->il
         Naive scaling:  4
     Optimized scaling:  3
      Naive FLOP count:  5.033e+07
  Optimized FLOP count:  1.049e+06
   Theoretical speedup:  48.000
  Largest intermediate:  4.096e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   3                   jk,ij->ik                                kl,ik->il
   3                   ik,kl->il                                   il->il
Actual speed up: 90.33518444642904
ZisIsNotZis
  • 1,570
  • 1
  • 13
  • 30
  • 1
    My understanding is that `np.einsum_path` seeks to determine the order in which a complex inner product is taken - one where there are multiple rectangular arrays. It's looking at the "creation of intermediate arrays". In `i->` there aren't intermediate arrays; – hpaulj Nov 07 '18 at 03:58
  • Look at `np.core.einsumfunc._flop_count`. There's no `dtype` argument. It's just looking at the dimensions of the arrays. My guess is that `einsum_path` ignores the `dtype` when evaluating the path. In effect it assumes all inputs have the same dtype, so this isn't a factor in evaluating the optimal order of a chain of `dot` products. – hpaulj Nov 07 '18 at 04:03
  • 1
    Your first link is from 2013. I believe that was before `einsum` implemented this `path` evaluation. Originally all it used was a compiled form of `nditer`, constructing one large iteration space. Since then there have been lots of attempts to optimize it, not just this path business. In some cases it actually uses `matmul` or `dot` rather than its own computation. – hpaulj Nov 07 '18 at 04:09
  • @hpaulj I guess since it's complicated, I have to read the source code in order to understand what exactly did it do. Actually I don't quite understand the documentation of `np.core.einsumfunc._flop_count` (or its examples). – ZisIsNotZis Nov 07 '18 at 04:21
  • It looks like the optimized cost finding is not accounting for this corner case and views `i->` as an inner product. Which it is in its rather simplistic view of the world, but there is only one flop per value (add) rather than two per value (multiply and add) for general inner products. The `idx_removed` [here](https://github.com/numpy/numpy/blob/v1.15.4/numpy/core/einsumfunc.py#L916) should read something like `inner = idx_removed and (len(input_sets) > 1))`. – Daniel Dec 31 '18 at 19:58

1 Answers1

0

I just digged into the source code of np.einsum_path. According to the comment here (i.e. here):

# Compute naive cost
# This isn't quite right, need to look into exactly how einsum does this

and the way optimal cost is computed (i.e. here, not posting, too long). It seems that the way how two costs are computed are in-consistent, while the first one is documented to be "not quite right".

I then printed the "native" (i.e. not optimized) einsum_path:

import numpy as np
print(np.einsum_path('i->', np.empty(2**30, 'b'), optimize=False)[1])

which surprisingly is "different from native":

  Complete contraction:  i->
         Naive scaling:  1
     Optimized scaling:  1
      Naive FLOP count:  1.074e+09
  Optimized FLOP count:  2.147e+09
   Theoretical speedup:  0.500
  Largest intermediate:  1.000e+00 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   1                         i->                                       ->

Therefore, the reason for the reported speed down is simply flop-counting bug. np.einsum actually did not do any path optimization (while it's still fundamentally faster than native np.sum for whatever reason).

ZisIsNotZis
  • 1,570
  • 1
  • 13
  • 30