8

I need to optimize a script that makes heavy use of computing L1 norm of vectors. As we know L1 norm in this case is just a sum of absolute values. When timing how fast numpy is in this task I found something weird: addition of all vector elements is about 3 times faster than taking absolute value of every element of the vector. This is a surprising result, as addition is pretty complex in comparison to taking absolute value, which only requires zeroing every 32-th bit of a datablock (assuming float32).

Why is that addition is 3x faster than a simple bitwise operation?

import numpy as np

a = np.random.rand(10000000)

%timeit np.sum(a)
13.9 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.abs(a)
41.2 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Lugi
  • 573
  • 4
  • 20
  • Very likely due to branch prediction - https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array – Maxim Oct 01 '17 at 13:25
  • Doesn't show any speedup when i sort the input array. – Lugi Oct 01 '17 at 13:43

2 Answers2

4

There are several things to consider here. sum returns a scalar abs returns an array. So even if adding two numbers and taking the absolute had the same speed abs would be slower because it needs to create the array. And it has to process twice as many elements (reading from input + writing to output).

So you cannot infer from these timings anything about the speed of addition vs. bitwise operation.

You could however check if it's faster to add something to each value of an array vs. taking the absolute of each value

%timeit a + 0.1
9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit abs(a)
9.98 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Or compare sum + memory allocation vs. taking the absolute

%timeit np.full_like(a, 1); np.sum(a)
13.4 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit abs(a) 
9.64 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Just in case you want to make computing the norm faster, you could try numba (or Cython, or writing a C or Fortran routine yourself), that way you avoid any memory allocations:

import numba as nb

@nb.njit
def sum_of_abs(arr):
    sum_ = 0
    for item in arr:
        sum_ += abs(item)
    return sum_

sum_of_abs(a)  # one call for the jitter to kick in
%timeit sum_of_abs(a)
# 2.44 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • (I actually made the input array 10 times bigger, so the timings here are 10x longer in comparison to the OP). For sum and memory allocation: 142 ms ± 1.97 ms For absolute value: 428 ms ± 20.4 ms Plain memory allocation: %timeit np.empty_like(a) 6.44 µs ± 22.8 ns, so memory allocation seems to take a negligible amount of time. – Lugi Oct 01 '17 at 13:53
  • @Lugi I updated the answer `np.empty_like` actually can "fake" allocating an array (depending on the OS). I now used `np.full_like` because that also "emulates" the doubled memory throughput. However the comparisons are still not "fair" in the sense that it actually allows conclusions of the speed of "abs" vs. "addition". – MSeifert Oct 01 '17 at 13:55
3

np.sum returns a scalar. np.abs returns a new array of the same size. Allocating memory for that new array is what takes the most time here. Compare

>>> timeit("np.abs(a)", "import numpy as np; a = np.random.rand(10000000)", number=100)
3.565487278989167
>>> timeit("np.abs(a, out=a)", "import numpy as np; a = np.random.rand(10000000)", number=100)
0.9392949139873963

The argument out=a tells NumPy to put the result in the same array a, overwriting old data there. Hence the speedup.

Sum is still slightly faster:

>>> timeit("np.sum(a)", "import numpy as np; a = np.random.rand(10000000)", number=100)
0.6874654769926565

but it does not require as much write memory access.

If you don't want to overwrite a, providing another array for the output of abs is a possibility, assuming you have to repeatedly take abs of arrays of the same type and size.

b = np.empty_like(a)   # done once, outside the loop
np.abs(a, out=b)
np.sum(b)

runs in about half the time of np.linalg(a, 1)

For reference, np.linalg computes L1 norm as

add.reduce(abs(x), axis=axis, keepdims=keepdims)

which involves allocation of memory for new array abs(x).


Ideally, there would be a way to compute the sum (or max or min) of all absolute values (or the results of another "ufunc") without moving all output to RAM and then retrieving it for sum/max/min. There was some discussion in NumPy repo, most recently at add a max_abs ufunc, but it hasn't reached implementation.

The ufunc.reduce method is available for functions with two inputs like add or logaddexp, but there is no addabs function (x, y : x+abs(y)) to reduce with.

  • I believe the function you provided thats 2x times as fast np.linalg l1 norm still does a lot of unnecessary operations. np.abs(a, out=b) moves the result to RAM, then np.sum(b) reads that result back from RAM, is my thinking here correct? – Lugi Oct 01 '17 at 14:01
  • Correct. Ideally, we'd have a ufunc `addabs` for "add absolute value of", and run [ufunc.reduce(a)](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ufunc.reduce.html), or simply have `abssum` array method that does not create an intermediate array. This was briefly [discussed in NumPy issue tracker](https://github.com/numpy/numpy/issues/5218) a few years ago. –  Oct 01 '17 at 14:24