10

While implementing some Gauss-Seidel solver with Python and Numpy I discovered an interesting side effect. I tried to extract some minimal example:


    number = 1000
    startup = 'import numpy as np;N = 2048;X = np.arange(N * N).reshape((N, N));'
    startup2 = startup + 'np.empty_like(X)'


    example = ('X[1:: 2, : -1: 2] = ('
               'X[: -1: 2, 1:: 2] +'
               'X[1:: 2, : -1: 2] +'
               'X[1:: 2, : -1: 2] +'
               'X[: -1: 2, : -1: 2])/4')

running print(timeit.timeit(example, setup=startup, number=number)) takes at my machine ~5s

while print(timeit.timeit(example, setup=startup2, number=number)) takes ~4s.

So around ~1s faster, although there is this unnecessary array allocation withnp.emtpy_like(X). I observed this effect on various machines and with various array sizes or iterations.

I assume that calculation of the right hand side in the assignment leads to an temporally array allocation. It seems that Numpy somehow reuses the unused array that was created with thenp.emtpy_like(X) to speed up the temporally array allocation.

Am I right with this assumption or is something totally different the reason for the differences in the time?

If I remove the /4 to

 example = ('X[1:: 2, : -1: 2] = ('
               'X[: -1: 2, 1:: 2] +'
               'X[1:: 2, : -1: 2] +'
               'X[1:: 2, : -1: 2] +'
               'X[: -1: 2, : -1: 2])')

Then, I can not observe differences in the execution time between the different versions. So I assume that in this case the calculation can be done in-place and then there is no temporally allocation.

Is there a more explicit way to exploit this effect? Just writing np.emtpy_like(X)` looks somehow "hacky" to me.

Thanks in advance!

Updates:

Thank you for your answers and comments so far.

Finally I found some more time to mess around with my observations and I am not sure if I my explanations above are clear enough. So my initial observation was, that this

    number = 1000
    N = 1024
    X = np.arange(N * N).reshape((N, N))
    np.empty_like(X)
    for _ in range(number):
        X[1:: 2, : -1: 2] = (X[: -1: 2, 1:: 2] + X[1:: 2, : -1: 2] +
                             X[1:: 2, : -1: 2] + X[: -1: 2, : -1: 2]) / 4

was faster then this

    number = 1000
    N = 1024
    X = np.arange(N * N).reshape((N, N))
    for _ in range(number):
        X[1:: 2, : -1: 2] = (X[: -1: 2, 1:: 2] + X[1:: 2, : -1: 2] +
                             X[1:: 2, : -1: 2] + X[: -1: 2, : -1: 2]) / 4

Which I found quiet surprising, because this totally unused and unnecessary array allocation with the np.empty_like(X) seems to accelerate the loop below. Thereby, it does not matter if I use np.empty_like , zeros_like, ones_like, ones(X.shape) or X.copy() as long there is a unused array allocated. And it also occurs for different N's, number of iterations and on different machines.

Furthermore I tried to investigate the issue with the tracemalloc package:

   number = 1000
   N = 1024   
   X = np.arange(N * N).reshape((N, N))
   tracemalloc.start()
   np.empty_like(X)
   for _ in range(number):
       X[1:: 2, : -1: 2] = (X[: -1: 2, 1:: 2] + X[1:: 2, : -1: 2] +
                            X[1:: 2, : -1: 2] + X[: -1: 2, : -1: 2]) / 4
   s2 = tracemalloc.take_snapshot()

   display_top(s2)

Where display_top is the method from the Documentation, except that I print out in Bytes not in KB.

When I run it without the extra array allocation from np.empty_like(X) I get some output like this:

Top 10 lines
#1: ./minexample.py:40: 160.0 B
    X[1:: 2, : -1: 2] = (X[: -1: 2, 1:: 2] + X[1:: 2, : -1: 2] + X[1:: 2, : -1: 2] + X[: -1: 2, : -1: 2]) / 4
#2: ./minexample.py:39: 28.0 B
    for _ in range(1000):
Total allocated size: 188.0 B

With the extra allocation I get this:

Top 10 lines
#1: ./minexample.py:40: 128.0 B
    X[1:: 2, : -1: 2] = (X[: -1: 2, 1:: 2] + X[1:: 2, : -1: 2] + X[1:: 2, : -1: 2] + X[: -1: 2, : -1: 2]) / 4
#2: ./minexample.py:38: 32.0 B
    np.empty_like(X)
#3: ./minexample.py:39: 28.0 B
    for _ in range(1000):
Total allocated size: 188.0 B

So the size that is allocated of the line in the loop is smaller then when there is no unused array preallocated. So it seems to me that this unused array is reused.

Maybe this explains this effect?

christoph
  • 101
  • 4
  • Very interesting question! I gave it some thoughts and I'm really interested in an explanation why this happens. Can you lift out the ``import`` from the first ``startup`` statement and see if that has any effect? – Marc Sances Oct 23 '20 at 18:37
  • 1
    Went along expected lines for me - `%timeit N = 2048;X = np.arange(N * N).reshape((N, N));` 6.95 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)... `%timeit N = 2048;X = np.arange(N * N).reshape((N, N)); np.empty_like(X);` 6.97 ms ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each). OP's case with `import numpy as np` seemed to be the unwanted bad apple with timings. @MarcSances – Divakar Oct 24 '20 at 13:45
  • I can't reproduce the result consistently. Meanwhile, in most runs the difference between the two measurements is less than 5%. The order of two `timeit` statements also makes a difference. – GZ0 Oct 26 '20 at 09:49
  • `%timeit N=2048; X = np.arange(N * N).reshape((N, N));` and `%timeit X = np.arange(N * N).reshape((N, N)); X=None;` is also different in time (17.4 ms ± 226 µs vs. 12.2 ms ± 114 µs)… – Héliton Martins Oct 31 '20 at 04:38

1 Answers1

3

empty_like(X) initializes an array (or matrix) with the same dimensions as X and fills it with random values, which is an incredibly fast process. But keep in mind that + empty_like(X) does not give you the desired result ( you would need zeros_like(X) )!

Why it's fast to add two arrays in numpy has to do with densely packed arrays and is described here: https://stackoverflow.com/a/8385658/14344821, this can potentially be faster than creating the matrix X with explicitly mentioned entries. Tips on how to create numpy arrays efficiently can be found here.

Casper Dijkstra
  • 1,615
  • 10
  • 37
  • 1
    "fills it with random values" No, it doesn't fill it with anything at all. Generating random values would be slow. It leaves the memory unchanged from when it was last used. "This function does not initialize the returned array" – endolith Mar 09 '22 at 21:57