0

Consider an example:

import numpy as np
x = np.ones((100, 100))
y = np.ones((100, 100))
z = np.ones((100, 100))
f = x * y + y * z
  • When evaluating f, how does numpy proceed the execution and menage memory for storing the intermediate results?
  • Is optimizations applied to convert f to f = (x + z) * y?
  • If we are instead evaluating f = (x + z) * y, does numpy allocate any temp memory beside the memory for f?

Or consider another example:

f = a + b + c # all of them of same dimension
  • Does numpy allocate intermediate memory when evaluating this equation?

[edit] Following @Daniel 's answer, a similar benchmark shows that numpy allocates memory twice for this expression.

It would be helpful if you could point me to some docs or related compiler techniques. Thanks!

Yixing Lao
  • 1,198
  • 17
  • 29
  • 2
    A simple timeit test shows `(x+z)*y` to be 30% faster - so I'd say no to that kind of optimization. – hpaulj Aug 12 '15 at 17:52
  • 2
    `x * y + y * z` is not necessarily equal to `(x + z) * y` when you're dealing with floating point numbers. Not even C compilers do this kind of optimization. See for example another question from SO: [Why doesn't GCC optimize `a*a*a*a*a*a` to `(a*a*a)*(a*a*a)`?](http://stackoverflow.com/q/6430448/12892) – Cristian Ciupitu Aug 12 '15 at 21:46

2 Answers2

1

It seems numpy doesn't do any optimization but, perhaps more surprisingly, neither does Theano.

Here's a script to compare the two variations for each implementation. Output below.

import timeit
import numpy
import theano
import theano.tensor as tt


def main():
    size = 1000
    iterations = 1000

    a = tt.matrix()
    b = tt.matrix()
    c = tt.matrix()
    f1 = theano.function(inputs=[a, b, c], outputs=a * b + b * c)
    f2 = theano.function(inputs=[a, b, c], outputs=(a + c) * b)
    theano.printing.debugprint(f1)
    theano.printing.debugprint(f2)

    x = numpy.ones((size, size))
    y = numpy.ones((size, size))
    z = numpy.ones((size, size))

    result = x * y + y * z

    start = timeit.default_timer()
    for _ in xrange(iterations):
        result1 = x * y + y * z
        assert numpy.all(result1 == result)
    print timeit.default_timer() - start
    start = timeit.default_timer()
    for _ in xrange(iterations):
        result2 = (x + z) * y
        assert numpy.all(result2 == result)
    print timeit.default_timer() - start
    start = timeit.default_timer()
    for _ in xrange(iterations):
        result3 = f1(x, y, z)
        assert numpy.all(result3 == result)
    print timeit.default_timer() - start
    start = timeit.default_timer()
    for _ in xrange(iterations):
        result4 = f2(x, y, z)
        assert numpy.all(result4 == result)
    print timeit.default_timer() - start


main()

I get the following output:

Elemwise{Composite{((i0 * i1) + (i1 * i2))}} [@A] ''   0
 |<TensorType(float64, matrix)> [@B]
 |<TensorType(float64, matrix)> [@C]
 |<TensorType(float64, matrix)> [@D]
Elemwise{Composite{((i0 + i1) * i2)}} [@A] ''   0
 |<TensorType(float64, matrix)> [@B]
 |<TensorType(float64, matrix)> [@C]
 |<TensorType(float64, matrix)> [@D]
9.19932313948
6.43367212255
4.15276831469
4.07725744595

So the manually optimized version is faster for both numpy and theano but the difference with Theano is smaller. The Theano computation graphs that are printing out show that Theano's optimizing compiler hasn't automatically improved the computation. Theano's version is faster overall though.

Note that these results can differ depending on the size of the matrices being operated on.

Daniel Renshaw
  • 33,729
  • 8
  • 75
  • 94
  • That's interesting, thanks. Didn't expect that Theano does not have this. Optimization is one part, while the real question is how intermediate buffers are handled during evaluation. – Yixing Lao Aug 12 '15 at 21:15
1

No, numpy does not do any such optimization.

How should it? Numpy implements n-dimensional array objects in a c-extension. The rest is pure python. numpy never gets to see the actual expression you are trying to evaluate. It performs operations one at a time as given by the evaluation order (see python documentation) order. Therefore, for each intermediate, numpy will have to allocate temporary memory. Therefore:

  • f = (x*y) + (y*z) (braces for clarity) gets three memory allocations, the last one being used for f. No expression rewriting is done (that would even be dangerous, as it could change the rounding effects).
  • f = (x+z)*f: two allocations, the last again gets bound to f.
  • f = (x + y) + z: also two allocations.

It would of course not be completely impossible for numpy to learn what expression you are evaluating, but without the help of the interpreter, all of them would be dirty tricks, bound to confuse users. numpy doesn't do any of that.

burnpanck
  • 1,955
  • 1
  • 12
  • 36
  • For your last example, `f = (x + y) + z`, it is possible to allocate only once, right? @burnpanck – Yixing Lao Aug 12 '15 at 21:57
  • 1
    `numpy` cannot forego the second allocation, as it does not know if the temporary created will be needed any further. For `numpy`, it looks exactly as `t = x+y; f=t+z`. You could need `t` later on. But you can rearrange the expression manually to save one allocation: `f=x+y; f+=z`. – burnpanck Aug 12 '15 at 22:01
  • that's right, just done a benchmark to confirm that. – Yixing Lao Aug 12 '15 at 22:07
  • 2
    The `out` parameter of `ufunc` gives you control over intermediates. e.g. `f=np.empty_like(x); np.add(x,z,out=f); np.multiply(y,f,out=f)` – hpaulj Aug 12 '15 at 23:55
  • @hpaulj; good point! However for standard operations like `np.add`, the in-place operations are equivalent. You cannot get around the allocation for `f`. Of course `np.empty` and `np.empty_like` are the most efficient there, but I'm sure that the automatic temporaries are allocated in the same way (i.e. no unnecessary zeroing before the operation). – burnpanck Aug 14 '15 at 11:42