What happens in numpy's log function? Are there ways to improve the performance?

Question

I have a computation project with heavy use of log function (for integers), billions of calls. I find the performance of numpy's log is surprisingly slow.

The following code takes 15 to 17 secs to complete:

import numpy as np
import time

t1 = time.time()
for i in range(1,10000000): 
    np.log(i)
t2 = time.time()
print(t2 - t1)

However, the math.log function takes much less time from 3 to 4 seconds.

import math
import time

t1 = time.time()
for i in range(1,10000000): 
    math.log(i)
t2 = time.time()
print(t2 - t1)

I also tested matlab and C#, which takes about 2 secs and just 0.3 secs respectively.

matlab

tic
for i = 1:10000000
    log(i);
end
toc

C#

var t = DateTime.Now;
for (int i = 1; i < 10000000; ++i)
     Math.Log(i);
Console.WriteLine((DateTime.Now - t).TotalSeconds);

Is there any way in python that I can improve the performance of log function?

I am pretty interested in the answer. Specially because I just went in the source code, and the function is hard to trace. At the left of it tin the source code, it is written : `# real signature unknown; restored from __doc__`. Can anybody also explain how the source code works? — Eric B, May 10 '17 at 12:39
The answer seems to be in here http://stackoverflow.com/questions/3650194/are-numpys-math-functions-faster-than-pythons?rq=1 — Bathsheba, May 10 '17 at 12:47
`np.log` is optimised to work on arrays of values, not single values. For example `np.log(np.arange(1,10000000))` (log of array of integers in that range) takes about 120ms for me. — Alex Riley, May 10 '17 at 12:47
Just asked a friend of mine and he pointed out the overhead cost of `numpy`. Basically, `numpy` tests for various things before probably executing `math.log`. — Eric B, May 10 '17 at 12:53
You are unfairly comparing those functions when you're not using `np.log()` on arrays. — Nils Werner, May 10 '17 at 16:47
Maybe you can reformulate your problem and explain why you need to call `np.log()` that often instead of using vectorisation. — Nils Werner, May 11 '17 at 07:47

score 5 · Accepted Answer · edited May 23 '17 at 12:34

NumPys functions are designed for arrays not for single values or scalars. They have a rather high overhead because they do several checks and conversions that will provide a speed benefit for big arrays but these are costly for scalars.

The conversion is really obvious if you check the type of the return:

>>> import numpy as np
>>> import math

>>> type(np.log(2.))
numpy.float64
>>> type(math.log(2.))
float

On the other hand the math-module is optimized for scalars. So they don't need that many checks (I think there are only two: Convert to float and check is it's <= 0). Which is why math.log is faster for scalars compared to numpy.log.

But if you operate on arrays and want to take the logarithm of all elements in the array NumPy can be much faster. On my computer if I time the execution of np.log on an array compared to math.log of each item in a list then the timing looks different:

arr = np.arange(1, 10000000)
%timeit np.log(arr)
201 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

lst = arr.tolist()
%timeit [math.log(item) for item in lst]
8.77 s ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So np.log will be many orders of magnitude faster on arrays (it's more than 40 times faster in this case)! And you don't need to write any loop yourself. As ufunc np.log will also correctly work on multidimensional numpy arrays and also allows to do the operation inplace.

As a rule of thumb: If you have an array with thousands of items NumPy will be faster, if you have scalars or only a few dozen items math + explicit loop will be faster.

Also don't use time for timing code. There are dedicated modules that give more accurate results, better statistics and disable garbage collection during the timings:

timeit (built-in)
perf (extension package)

I generally use %timeit which is a convenient wrapper around the timeit functionality, but it requires IPython. They already conveniently display the result mean and deviation and do some (mostly) useful statistics like displaying the "best of 7" or "best of 3" result.

I recently analyzed the runtime behaviour of numpy functions for another question, some of the points also apply here.

score 1 · Answer 2 · answered May 15 '17 at 11:49

Interestingly, the issue of the python standard library being slow doesn't replicate on my machine (Windows 10, running Python 2.7.11 and numpy 1.11.0).

>>> t1 = time.time()
>>> for i in range(1,10000000): 
>>>     _ = np.log(i)
>>> t2 = time.time()
>>> print(t2 - t1)
9.86099982262
>>> t1 = time.time()
>>> for i in range(1,10000000): 
>>>     _ = math.log(i)
>>> t2 = time.time()
>>> print(t2 - t1)
2.48300004005

Similar to your performance in Matlab. @Nils raises a good point though, numpy is designed to be efficient on arrays.

>>> t1 = time.time()
>>> for i in range(1,1000): 
>>>     _ = np.log(np.arange(1,10000))
>>> t2 = time.time()
>>> print(t2 - t1)
0.146000146866
>>> t1 = time.time()
>>> for i in range(1,1000): 
>>>     _ = [math.log(i) for i in range(1,10000)]
>>> t2 = time.time()
>>> print(t2 - t1)
2.3220000267

If you can vectorize your input, numpy will outperform the standard math library and even come close to C#.

What happens in numpy's log function? Are there ways to improve the performance?

2 Answers2