why is numpy.linalg.norm slow when called many times for small size data?

Question

import numpy as np
from datetime import datetime
import math

def norm(l):
    s = 0
    for i in l:
        s += i**2
    return math.sqrt(s)

def foo(a, b, f):
    l = range(a)
    s = datetime.now()
    for i in range(b):
        f(l)
    e = datetime.now()
    return e-s

foo(10**4, 10**5, norm)
foo(10**4, 10**5, np.linalg.norm)
foo(10**2, 10**7, norm)
foo(10**2, 10**7, np.linalg.norm)

I got the following output:

0:00:43.156278
0:00:23.923239
0:00:44.184835
0:01:00.343875

It seems like when np.linalg.norm is called many times for small-sized data, it runs slower than my norm function.

What is the cause of that?

MSeifert · Accepted Answer · 2018-04-16T21:54:49.807

First of all: datetime.now() isn't appropriate to measure performance, it includes the wall-time and you may just pick a bad time (for your computer) when a high-priority process runs or Pythons GC kicks in, ...

There are dedicated timing functions/modules available in Python: the built-in timeit module or %timeit in IPython/Jupyter and several other external modules (like perf, ...)

Let's see what happens if I use these on your data:

import numpy as np
import math

def norm(l):
    s = 0
    for i in l:
        s += i**2
    return math.sqrt(s)

r1 = range(10**4)
r2 = range(10**2)

%timeit norm(r1)
3.34 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.linalg.norm(r1)
1.05 ms ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit norm(r2)
30.8 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.linalg.norm(r2)
14.2 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It isn't slower for short iterables it's still faster. However note that the real advantage from NumPy functions comes if you already have NumPy arrays:

a1 = np.arange(10**4)
a2 = np.arange(10**2)

%timeit np.linalg.norm(a1)
18.7 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.norm(a2)
4.03 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Yeah, it's quite a lot faster now. 18.7us vs. 1ms - almost 100 times faster for 10000 elements. That means most of the time of np.linalg.norm in your examples was spent in converting the range to a np.array.

I just %timeit in my example, got the following:%timeit f(100, 10**5, norm), 1 loop, best of 3: 442 ms per loop. %timeit f(100, 10**5, np.linalg.norm), 1 loop, best of 3: 607 ms per loop. I thought it may because of the overhead in numpy and also converting range to np.array — Ladih, Apr 16 '18 at 21:56
I think there's something missing in your comment - or I don't understand what you want to say. Sorry. — MSeifert, Apr 16 '18 at 21:57
Yes, you are correct. If I use numpy.array, np.linalg.norm will be faster. Thanks — Ladih, Apr 16 '18 at 22:00
@posmee Yeah, NumPy functions have the tendency to have much higher overhead than pure Python functions. But most break-even between 100 and 1000 elements. And before the break-even it's not too bad, normally just a few us slower and if you're not in a tight loop a few microseconds will be micro-optimizing (and generally not worth the trouble). — MSeifert, Apr 16 '18 at 22:08

score 2 · Answer 2 · answered Apr 17 '18 at 01:36

You are on the right way

np.linalg.norm has a quite high overhead on small arrays. On large arrays both the jit compiled function and np.linalg.norm runs in a memory bottleneck, which is expected on a function that does simple multiplications most of the time.

If the jitted function is called from another jitted function it might get inlined, which can lead to a quite a lot larger advantage over the numpy-norm function.

Example

import numba as nb
import numpy as np

@nb.njit(fastmath=True)
def norm(l):
    s = 0.
    for i in range(l.shape[0]):
        s += l[i]**2
    return np.sqrt(s)

Performance

r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:0.42µs
linalg:4.46µs

r1 = np.array(np.arange(10**2),dtype=np.int32)
Numba:8.9µs
linalg:13.4µs

r1 = np.array(np.arange(10**2),dtype=np.float64)
Numba:0.35µs
linalg:3.71µs

r2 = np.array(np.arange(10**4), dtype=np.float64)
Numba:1.4µs
linalg:5.6µs

Measuring Performance

Call the jit-compiled function one time before the measurement (there is a static compilation overhead on the first call)
Make clear if the measurement is valid (since small arrays stays in processor-cache there may be to optimistic results exceeding your RAM throughput on realistic examples eg. example)

why is numpy.linalg.norm slow when called many times for small size data?

2 Answers2

Linked