Why is np.linalg.norm(..., axis=1) slower than writing out the formula for vector norms?

Question

To normalize the rows of a matrix X to unit length, I usually use:

X /= np.linalg.norm(X, axis=1, keepdims=True)

Trying to optimize this operation for an algorithm, I was quite surprised to see that writing out the normalization is about 40% faster on my machine:

X /= np.sqrt(X[:,0]**2+X[:,1]**2+X[:,2]**2)[:,np.newaxis]
X /= np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]

How comes? Where is the performance lost in np.linalg.norm()?

import numpy as np
X = np.random.randn(10000,3)

%timeit X/np.linalg.norm(X,axis=1, keepdims=True)
# 276 µs ± 4.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit X/np.sqrt(X[:,0]**2+X[:,1]**2+X[:,2]**2)[:,np.newaxis]
# 169 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit X/np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
# 185 µs ± 4.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I observe this for (1) python3.6 + numpy v1.17.2 and (2) python3.9 + numpy v1.19.3 on a MacbookPro 2015 with OpenBLAS support.

I don't think this is a duplicate of this post, which addresses matrix norms, while this one is about the L2-norm of vectors.

Looking at the source code, it handles a lot of things under the hood which could very well be the reason why. One quick way to "check" this is to copy-paste the source but remove all the cruft that doesn't apply then run the test again — IanQ, Nov 21 '20 at 22:22
You can try line by line profiling. https://stackoverflow.com/questions/3927628/how-can-i-profile-python-code-line-by-line/3927671#3927671 — NoDataDumpNoContribution, Nov 21 '20 at 22:42
@IanQuah It's [this line](https://github.com/numpy/numpy/blob/1713503b4117550152ca47feee651e67275f3557/numpy/linalg/linalg.py#L2560) that consumes most of the time. — normanius, Nov 21 '20 at 22:57

normanius · Accepted Answer · 2020-11-22T11:36:24.733

The source code for row-wise L2-norm boils down to the following lines of code:

def norm(x, keepdims=False):
    x = np.asarray(x)
    s = x**2
    return np.sqrt(s.sum(axis=(1,), keepdims=keepdims))

The simplified code assumes real-valued x and makes use of the fact that np.add.reduce(s, ...) is equivalent to s.sum(...).

The OP question therefore is the same as asking why np.sum(x,axis=1) is slower than sum(x[:,i] for i in range(x.shape[1])):

%timeit X.sum(axis=1, keepdims=False)
# 131 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum(X[:,i] for i in range(X.shape[1]))
# 36.7 µs ± 91.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This question has been answered already here. In short, the reduction (.sum(axis=1)) comes with overhead costs that generally pay off in terms of floating-point precision and speed (e.g. cache mechanics, parallelism), but don't in the special case of a reduction over just three columns. In this case, the overhead is relatively large compared to the actual computation.

The situation changes if X has more columns. The numpy-boosted normalization now is substantially faster than the reduction using a python for-loop:

X = np.random.randn(10000,100)
%timeit X/np.linalg.norm(X,axis=1, keepdims=True)
# 3.36 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit X/np.sqrt(sum(X[:,i]**2 for i in range(X.shape[1])))[:,np.newaxis]
# 5.92 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another related SO thread is found here: numpy ufuncs vs. for loop.

The question remains why common special cases for reduction (such as the summation over the columns or rows of a matrix with low axis dimension) are not treated by numpy explicitly. Maybe it's because the effect of such optimizations often depends strongly on the target machine and increases code complexity considerably.

Why is np.linalg.norm(..., axis=1) slower than writing out the formula for vector norms?

1 Answers1

Linked