Python performace on class instance vs local (numpy) variables

Question

I've read other posts on how python speed/performance should be relatively unaffected by whether code being run is just in main, in a function or defined as a class attribute, but these do not explain the very large differences in performance that I see when using class vs local variables, especially when using the numpy library. To be more clear, I made an script example below.

import numpy as np
import copy 

class Test:
    def __init__(self, n, m):
        self.X = np.random.rand(n,n,m)
        self.Y = np.random.rand(n,n,m)
        self.Z = np.random.rand(n,n,m)
    def matmul1(self):
        self.A = np.zeros(self.X.shape)
        for i in range(self.X.shape[2]):
            self.A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
        return
    def matmul2(self):
        self.A = np.zeros(self.X.shape)
        for i in range(self.X.shape[2]):
            x = copy.deepcopy(self.X[:,:,i])
            y = copy.deepcopy(self.Y[:,:,i])
            z = copy.deepcopy(self.Z[:,:,i])
            self.A[:,:,i] = x @ y @ z
        return

t1 = Test(300,100) 
%%timeit   
t1.matmul1()
#OUTPUT: 20.9 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
t1.matmul2()
#OUTPUT: 516 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this script I define a class with attributes X, Y and Z as 3-way arrays. I also have two function attributes (matmul1 and matmul2) which loop through the 3rd index of the arrays and matrix multiply each of the 3 slices to populate an array, A. matmul1 just loops through class variables and matrix multiplies, whereas matmul2 creates local copies for each matrix multiplication within the loop. Matmul1 is ~40X slower than matmul2. Can someone explain why this is happening? Maybe I am thinking about how to use classes incorrectly, but I also wouldn't assume that variables should be deep copied all the time. Basically, what is it about deep copying that affects my performance so significantly, and is this unavoidable when using class attributes/variables? It seems like its more than just the overhead of calling class attributes as discussed here. Any input is appreciated, thanks!

Edit: My real question is why do copies of, instead of views of subarrays of class instance variables, result in much better performance for these types of methods.

If i interpret correctly, you wonder why **the deep-copy version is much much faster** (what the comment above misses out?). This needs profiling and more analysis steps. But there are two obvious candidates: A: matmul2 indexed allows numpy to delegate to BLAS and in matmul1 it's failing somehow. Maybe? Maybe not. B: indexed deep-copy O(n^2) does change the *storage orders* before doing some O(n^3) operation. This can sometimes be faster due cache-blocking (modern cpus). Both are guesses which might be wrong. It's not what you assume for sure. Profiling will show that time is spent in algebra. — sascha, Feb 10 '21 at 00:29
@juanpa.arrivillaga the confusion is that performing this expensive operation appears to result in a much *faster* overall operation. — Karl Knechtel, Feb 10 '21 at 01:16
`x=self.X[:,:,i].copy()` would be sufficient. `deepcopy` only matters when the array `dtype` is `object. Not that this affects your timings. — hpaulj, Feb 10 '21 at 02:11
I would think this has something to do with the fact that "self.X[:,:,i]" does not return a copy of those items in self.X, it returns a "view object". You are then doing matmul on two view objects. I don't know why that would be significantly slower than matmul in general. Maybe that's a bug? — Bobby Ocean, Feb 10 '21 at 02:53
@BobbyOcean Yeah it has something to do with making copies for sure (I suppose that's obvious already), but when I do numpy.linalg.multi_dot([self.X[:,:,i], self.Y[:,:,i], self.Z[:,:,i]]) f\, which makes a copy in the within the function, for matmul1, the speed is equivalent to matmul2. I have noticed this when passing (views of) instance variables to scipy optimize methods as well. Just wondering why the copies make so much of a difference. — pigtowndandy, Feb 10 '21 at 03:26
When all the arrays have the same size `multi_dot` doesn't do anything special. — hpaulj, Feb 10 '21 at 05:37
`np.dot` gets similar speed to your `matmul2` even without the copies. It looks like `matmul` is taking some sub-optimal route with the arrays are views, whether produced by indexing or by transpose. — hpaulj, Feb 10 '21 at 07:40
It is simply not possible to call gemm (which numpy calls in the backend) with strided memory. If you provide strided arrays numpy uses another internal algorithm which is much slower. (You could also try a simple matrix multiplication with dtype=np.int64 instead of np.float64) to see the same effect. — max9111, Feb 10 '21 at 14:39
@max9111 I know this isn't included in my original question, but does this explain why ```A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])``` gets me the same performance as when producing deep copies and using the @ operator? — pigtowndandy, Feb 10 '21 at 16:01
@pigtowndandy It looks to be the case. The numpy source isn't so easy to read (at least for me). The reason for this behavior can be found here https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/matmul.c.src — max9111, Feb 10 '21 at 16:44

hpaulj · Accepted Answer · 2021-02-10T07:26:59.527

If you put the m dimension first, you could do this product without iteration:

In [146]: X1,Y1,Z1 = X.transpose(2,0,1), Y.transpose(2,0,1), Z.transpose(2,0,1)
In [147]: A1 = X1@Y1@Z1
In [148]: np.allclose(A, A1.transpose(1,2,0))
Out[148]: True

However sometimes, working with very large arrays is slower, due to memory management complexities.

It might worth testing

 A1[i] = X1[i] @ Y1[i] @ Z1[i]

where the iteration is on the outermost dimension.

My computer is too small to do good timings on these array sizes.

edit

I added these alternatives to your class, and tested with a smaller case:

In [67]: class Test:
    ...:     def __init__(self, n, m):
    ...:         self.X = np.random.rand(n,n,m)
    ...:         self.Y = np.random.rand(n,n,m)
    ...:         self.Z = np.random.rand(n,n,m)
    ...:     def matmul1(self):
    ...:         A = np.zeros(self.X.shape)
    ...:         for i in range(self.X.shape[2]):
    ...:             A[:,:,i] = self.X[:,:,i] @ self.Y[:,:,i] @ self.Z[:,:,i]
    ...:         return A
    ...:     def matmul2(self):
    ...:         A = np.zeros(self.X.shape)
    ...:         for i in range(self.X.shape[2]):
    ...:             x = self.X[:,:,i].copy()
    ...:             y = self.Y[:,:,i].copy()
    ...:             z = self.Z[:,:,i].copy()
    ...:             A[:,:,i] = x @ y @ z
    ...:         return A
    ...:     def matmul3(self):
    ...:         x = self.X.transpose(2,0,1).copy()
    ...:         y = self.Y.transpose(2,0,1).copy()
    ...:         z = self.Z.transpose(2,0,1).copy()
    ...:         return (x@y@z).transpose(1,2,0)
    ...:     def matmul4(self):
    ...:         x = self.X.transpose(2,0,1).copy()
    ...:         y = self.Y.transpose(2,0,1).copy()
    ...:         z = self.Z.transpose(2,0,1).copy()
    ...:         A = np.zeros(x.shape)
    ...:         for i in range(x.shape[0]):
    ...:             A[i] = x[i]@y[i]@z[i]
    ...:         return A.transpose(1,2,0)

In [68]: t1=Test(100,50)
In [69]: np.max(np.abs(t1.matmul2()-t1.matmul4()))
Out[69]: 0.0
In [70]: np.allclose(t1.matmul3(),t1.matmul2())
Out[70]: True

The view iteration is 10x slower:

In [71]: timeit t1.matmul1()
252 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: timeit t1.matmul2()
26 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The additions are about the same:

In [73]: timeit t1.matmul3()
30.8 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [74]: timeit t1.matmul4()
27.3 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Without the copy(), the transpose produces a view, and times are similar to matmul1 (250ms).

My guess is that with "fresh" copies, matmul is able to pass them to the best BLAS function by reference. With views, as in matmul1, it has to take some sort of slower route.

But if I use dot instead of matmul, I get the faster time, even with the matmul1 iteation.

In [77]: %%timeit
    ...: A = np.zeros(X.shape)
    ...: for i in range(X.shape[2]):
    ...:     A[:,:,i] = X[:,:,i].dot(Y[:,:,i]).dot(Z[:,:,i])
25.2 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It sure looks like matmul with views is taking some suboptimal calculation choice.

As you show in your last code block, using views with numpy.dot() vs. the @ operator will result in the performance disparities that we've been talking about. I think you have basically answered the question. If I want to really get deeper into this (e.g., I need to avoid functions that make copies due to array size, if that's even what is going with numpy.dot()), I will ask a more specific question about that. Thank you! — pigtowndandy, Feb 10 '21 at 15:55

Python performace on class instance vs local (numpy) variables

1 Answers1

edit