Why is looping through pytorch tensors so slow (compared to Numpy)?

Question

I've been working with image transformations recently and came to a situation where I have a large array (shape of 100,000 x 3) where each row represents a point in 3D space like:

pnt = [x y z]

All I'm trying to do is iterating through each point and matrix multiplying each point with a matrix called T (shape = 3 X 3).

Test with Numpy:

def transform(pnt_cloud, T):
    
    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = np.dot(T, pnt)
        
        if xyz_pnt[0] > 0:
            arr[i] = xyz_pnt[0]
            
        i += 1
           
    return arr

Calling the following code and calculating runtime (using %time) gives the output:

Out[190]: CPU times: user 670 ms, sys: 7.91 ms, total: 678 ms
Wall time: 674 ms

Test with Pytorch Tensor:

import torch

tensor_cld = torch.tensor(pnt_cloud)
tensor_T   = torch.tensor(T)

def transform(pnt_cloud, T):
    depth_array = torch.tensor(np.zeros(pnt_cloud.shape[0]))

    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = torch.matmul(T, pnt)
        
        if xyz_pnt[0] > 0:
            depth_array[i] = xyz_pnt[0]
            
        i += 1
            
        
    return depth_array

Calling the following code and calculating runtime (using %time) gives the output:

Out[199]: CPU times: user 6.15 s, sys: 28.1 ms, total: 6.18 s
Wall time: 6.09 s

NOTE: Doing the same with torch.jit only reduces 2s

I would have thought that PyTorch tensor computations would be much faster due to the way PyTorch breaks its code down in the compiling stage. What am I missing here?

Would there be any faster way to do this other than using Numba?

Quibble: how are you multiplying a point of shape (3,) with matrix of shape (4,4)? Aren't the dimensions incompatible? — NNN, Sep 30 '20 at 11:38
is it possible pytorch is accumulating gradients for this operation? — Shai, Sep 30 '20 at 11:40
Ohh yes, there is a fourth term I removed that to make the question more understandable and general, and forgot to reduce the size of the other matrix. I've edited the question, thanks for pointing that out Natchiket — lakshjaisinghani, Sep 30 '20 at 12:55

score 2 · Answer 1 · answered Oct 01 '20 at 22:04

For the speed, I got this reply from the PyTorch forums:

operations of 1-3 elements are generally rather expensive in PyTorch as the overhead of Tensor creation becomes significant (this includes setting single elements), I think this is the main thing here. This is also the reason why the JIT doesn’t help a whole lot (it only takes away the Python overhead) and Numby shines (where e.g. the assignment to depth_array[i] is just memory write).
the matmul itself might differ in speed if you have different BLAS backends for it in PyTorch vs. NumPy.

Shai · Answer 2 · 2020-10-01T05:37:33.627

1

Why are you using a for loop??
Why do you compute a 3x3 dot product and only uses the first element of the result??

You can do all the math in a single matmul:

with torch.no_grad():
  depth_array = torch.matmul(pnt_cloud, T[:1, :].T)  # nx3 dot 3x1 -> nx1
  # since you only want non negative results
  depth_array = torch.maximum(depth_array, 0)

Since you want to compare runtime to numpy, you should disable gradient accumulation.

edited Oct 01 '20 at 05:37

answered Sep 30 '20 at 15:09

Shai

111,146
38
238
371

In the question above, I state that pnt_cld is a 100,000 x 3 matrix, i.e. each row is 1 x 3. I'm matrix multiplying each 1 x 3 vector with T (3x3). [nx3 dot 3x1 -> nx3 does not equal the above procedure] – lakshjaisinghani Sep 30 '20 at 23:56
Another thing I found was that tensor operations just by themselves don't accumulate gradients. The only accumulate gradients when the operation is in regards to a nn.model object. – lakshjaisinghani Oct 01 '20 at 00:00
@dankpenny I updated the comment in the code regarding matrix sizes. Since you only use one element of the result - you can multiply by a 3x1 row of the matrix, rather than the entire matrix. – Shai Oct 01 '20 at 05:38
Ohh yess, my bad haha! – lakshjaisinghani Oct 01 '20 at 22:01

Why is looping through pytorch tensors so slow (compared to Numpy)?

Test with Numpy:

Test with Pytorch Tensor:

2 Answers2