Fast inverse square root in python to normalize a vector

Question

I want to normalize a vector. The easiest way to do it is

import numpy as np
v = np.random.rand(3)
v /= np.linalg.norm(v)

But I'm worried about the performance of my package and summing the squares (inevitable), taking the square root, then dividing all the vector is not a good idea.

Then I got in this question which solution uses sklearn.preprocessing.normalize to do it. Unfortunately, it adds another requirement/dependency to my package.

Here is the question.

Shouldn’t there be a numpy function to do that? Which uses the fast inverse square root algorithm. Or is it outside the numpy's scope and there shouldn't be a function like that?
Should I implement my function own in cython/numba?
Or if I'm so worried about performance, should I give up python and start making code in C/C++?

You should stop worrying about performance until you have *measured* that you have a performance problem and this is causing it. — mkrieger1, Jun 18 '22 at 17:16
If `sklearn.preprocessing.normalize` does what you need then can't you look at its source figure out what it is doing then write your own function to do the same thing? Your question 1 - are you asking someone to search through the numpy documentation for you to find a function you want? — wwii, Jun 18 '22 at 17:17
I think that [this file](https://github.com/ajcr/ajcr.github.io/blob/master/_posts/2016-04-01-fast-inverse-square-root-python.md) is worth reading for you. — The_spider, Jun 18 '22 at 17:36
**don't even think** about the "fast inverse square root" algorithm. that's a numerical **approximation**. `v / np.linalg.norm(v)`, ignoring some fixed overhead, does exactly what it needs to, and nothing more, and that is: it makes a *new* vector that is scaled to have length 1. — Christoph Rackwitz, Jun 19 '22 at 14:16
@wwii, I searched in ```numpy``` docs before writing the question and I did not find such a function, so the question was more "Why is there no function?". About looking at the ```sklearn```'s source function, doesn't help much cause it's implemented in lower level and the algorithm (as linked) is already well known. — Carlos Adir, Jun 19 '22 at 20:32
@ChristophRackwitz I did not get it. Every square root is a numerical approximation. Calculating and reversing, still remains an approximation, just like using the "fast inverse square root" algorithm. If we want more precision, we just apply Newton's iteration again and it converges very fast from the value we're already. About ```v/=np.linalg.norm(v)``` it doesn't create a new vector, it divides all the vector ```v``` by a ```numpy.float64```, we can look at the memory address using ```id(v)``` before and after the operation. Creating a new vector is not a good idea for efficiency. — Carlos Adir, Jun 19 '22 at 20:44
a proper sqrt is accurate to (close to) as many bits as your data type has. the "fast 1/sqrt" is is doing **one iteration** of a numerical algorithm. it'll be quite far off. I don't know what kinda precision you need but you'll get perhaps 3 digits from this... and it won't be faster than using CPU instructions, because CPUs are made to calculate this quickly and accurately. this is premature optimization, at the expense of accuracy. — Christoph Rackwitz, Jun 19 '22 at 20:50

score 2 · Accepted Answer · answered Jun 19 '22 at 00:19

Shouldn’t there be a numpy function to do that? Which uses the fast inverse square root algorithm. Or is it outside the numpy's scope and there shouldn't be a function like that?

I am not aware of any function in Numpy doing exactly that. Multiple functions calls are needed in pure-Numpy. sklearn.preprocessing.normalize is indeed a good alternative (and AFAIK not the only package to provide that).

The thing is Numpy is not designed to compute small arrays efficiently. The overhead of Numpy calls is huge for small arrays (like with only 3 values). Combining multiple function calls only make it worse. The overhead is basically due to type/shape/value checking, internal function calls, the CPython interpreter as well as the allocation of new arrays. Thus, even if Numpy would provide exactly the function you wanted, it would be slow for an array with only 3 items.

Should I implement my function own in cython/numba?

This is a good idea since Numba can do that with a much smaller overhead. Note that Numba function calls still have a small overhead though, but calling them from a Numba context is very cheap (native calls).

For example, you can use:

# Note:
# - The signature cause an eager compilation
# - ::1 means the axis is contiguous (generate a faster code)
@nb.njit('(float64[::1],)')
def normalize(v):
    s = 0.0
    for i in range(v.size):
        s += v[i] * v[i]
    inv_norm = 1.0 / np.sqrt(s)
    for i in range(v.size):
        v[i] *= inv_norm

This function does not allocate any new array as it works in-place. Moreover, Numba can only make the minimal amount of checks in a wrapping function. The loops are very fast but they can be made even faster if you replace v.size by the actual size (eg. 3) because the JIT can then unroll the loop and generate a nearly optimal code. np.sqrt will be inlined and it should generate a fast square-root FP instruction. If you use the flag fastmath=True, the JIT might even be able to compute the reciprocal square root using a dedicated faster instruction on x86-64 platform (note that fastmath is unsafe if you use special values like NaN or care about the FP associativity). Still, the overhead of calling this function is likely of 100-300 ns on mainstream machines for very small vectors: the CPython wrapping function have a significant overhead. The only solution to remove it is to use Numba/Cython in the caller function. If you need to use them on most of your project, then writing directly a C/C++ code is certainly better.

Or if I'm so worried about performance, should I give up python and start making code in C/C++?

It depends of your overall project but is you want to manipulate many small vector like this, it is much more efficient to use C/C++ directly. The alternative is to use Numba or Cython for kernels that are currently slow.

The performance of a well-optimized Numba code or Cython code can be very close to the one of natively compiled C/C++ code. For example, I succeed to outperform the heavily-optimized OpenBLAS code with Numba once (thanks to specialization). One of the main source of overhead in Numba is array bound checking (that can be often optimized out regarding the loops). C/C++ are lower-level so you do not pay any hidden cost but the code can be harder to maintain. Additionally, you can apply lower-level optimizations that are not even possible in Numba/Cython (eg. using directly SIMD intrinsics or assembly instructions, generating specialized code with templates).

score 1 · Answer 2 · answered Jun 18 '22 at 17:37

The best solution that you can achieve has a time complexity of O(2n), which is not very efficient but linear. The way to proceed is the following:

1- First you calculate the magnitude of the vector and its numerical inverse, which inevitably will lead to a linear time complexity regardless of the performance of the programming language. So until this point, we have an O(n) algorithm.

2- All the components of the vector are multiplied by the calculated inverse magnitude of the original vector, resulting in its normalization. As we are traversing the entire list again, we add another O(n) term to the algorithm's complexity. Finally we add those time complexity terms and we get O(n)+O(n)=O(2n).

import numpy as np
v = np.random.rand(3)
mod = (sum([i**2 for i in v]))**-0.5
print(v, mod)
v *= mod
print(v)

However, this does not seem to require a high degree of optimization and efficiency. Nevertheless, I will take a closer look at it in order to find a better way of approaching this problem. Maybe other programming resources like recursion or advanced data structures can slightly decrease its running time.

Why do you use the list comprehension here? You can just use `np.sum(v**2)`, this is the reason why people are even using them. — The_spider, Jun 18 '22 at 19:52
not every problem benefits from time complexity analysis or suggesting complex data structures that clearly have no benefits here. there's more to computer science than the first two semesters. — Christoph Rackwitz, Jun 19 '22 at 14:20
@The_spider ```v *= np.sum(v**2)**-0.5``` takes almost ```1.5``` times more time than the list comprehension ```v *= (sum([i**2 for i in v]))**-0.5``` he used. — Carlos Adir, Jun 19 '22 at 21:18

Harald · Answer 3 · 2022-06-18T17:42:48.497

As written by @mkrieger1, write first your code and test if the performance is acceptable for your work. If not start trying out the solutions your mentioned and compare the differences on your device. Generally, the hardware, onto the code is run, can have an impact on the performance differences. Meaning that eventually not all libraries are superior over others across all devices.

If you really want to go all in, I suggest executing your code in parallel. Eventually even on the GPU. You can use pytorch for this or write code in C and CUDA (or similar) to get the most out of the system. It is however important that summing up the elements has to be done as well in parallel.

Finally, if you stick to python, take a look at python 11 as this release improved python’s performance in some parts.

Fast inverse square root in python to normalize a vector

3 Answers3