0

I would like to perform a million times 10 to the power of a real number (10**x, where x = float) as fast as possible. My fastest option so far was using numpy's exp function:

import numpy as np

n = 1000000
p = np.random.uniform(0.,1.,n)

def pow10(p):
    ln10 = np.log(10)
    return np.exp(p*ln10)

Which takes on my machine about 6 ms.

Using Cython, performing the power function a million times is about 10 000 times faster (500 ns):

%%cython
import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cpow10(double[::1] p, int n):
    cdef int i
    cdef double res
    for i in range(n):
        res = 10**p[i] # just calculation
    return True

result = cpow10(p,n)

BUT only if I don't store the results back in an array:

def cpow10(double[::1] p,double[::1] res, int n):
    cdef int i
    for i in range(n):
        res[i] = 10**p[i] # calculation and storage
    return res

r = np.zeros((n),dtype=float)
result = cpow10(p,r,n)

Which yields again about 6 ms. Any idea why I loose so much speed and how to fix this ?

aposch72
  • 161
  • 2
  • 10
    This has all the signs of C realizing that the result is unused and just skipping the whole loop (i.e. you've not really improved anything at all). You can probably make some improvements, but Numpy is probably pretty good here and you may not beat it by much – DavidW Apr 06 '21 at 10:42
  • Why not just use `numpy.power`? It's probably better than `np.exp(p*ln10)` (since it skips an intermediate). – DavidW Apr 06 '21 at 10:44
  • Hello DavidW, numpy.power is 3 times slower than the trick with the natural log (about 17 ms instead of 6).The big questionmark is, why does the second code perform a million power function calculations in just 500 ns? When I return "res" instead of True it holds the same value as the last element in the result array of the third code! – aposch72 Apr 06 '21 at 13:45
  • It should be quite easy to outperform numpy using numexpr or numba by a bit (also think of the needed precision). The reason is Intel VML/SVML support out of the box and easy parallelization and avoiding any temporary arrays. https://stackoverflow.com/a/56922359/4045774 In Cython you would have to use VML explicitly. https://stackoverflow.com/a/56939240/4045774 Also think of avoiding output allocation on every call. This is also quite costly. – max9111 Apr 06 '21 at 14:27
  • 1
    @aposch72 The second code does not perform a million per calculations. It notices the result is unused and doesn't bother. Returning the last value of `res` only forces it to perform 1 calculation. Fair enough about `power` - I assumed it might be better, but didn't test it. – DavidW Apr 06 '21 at 14:42

0 Answers0