I need to optimize a script that makes heavy use of computing L1 norm of vectors. As we know L1 norm in this case is just a sum of absolute values. When timing how fast numpy is in this task I found something weird: addition of all vector elements is about 3 times faster than taking absolute value of every element of the vector. This is a surprising result, as addition is pretty complex in comparison to taking absolute value, which only requires zeroing every 32-th bit of a datablock (assuming float32).
Why is that addition is 3x faster than a simple bitwise operation?
import numpy as np
a = np.random.rand(10000000)
%timeit np.sum(a)
13.9 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.abs(a)
41.2 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)