Why are log2 and log1p so much faster than log and log10, in numpy?

Question

Whilst playing around with this question I noticed something I couldn't explain regarding the relative performance of np.log2, np.log and np.log10:

In [1]: %%timeit x = np.random.rand(100000)
   ....: np.log2(x)
   ....: 
1000 loops, best of 3: 1.31 ms per loop

In [2]: %%timeit x = np.random.rand(100000)
np.log(x)
   ....: 
100 loops, best of 3: 3.64 ms per loop

In [3]: %%timeit x = np.random.rand(100000)
np.log10(x)
   ....: 
100 loops, best of 3: 3.93 ms per loop

np.log2 is about 3x faster than np.log and np.log10. Perhaps even more counter-intuitively, np.log1p(x), which computes ln(x + 1), is on par with np.log2:

In [4]: %%timeit x = np.random.rand(100000)
np.log1p(x)
   ....: 
1000 loops, best of 3: 1.46 ms per loop

I obtained almost identical timings in numpy v1.10.1 and v1.8.2.

Is there an intuitive explanation for these discrepancies in runtime performance?

[this answer](http://math.stackexchange.com/a/61236/277288) in math SE seems to say that some methods reduce by `log2` for calculating any log. this may mean that the implementation of the log functions of np depend, in one way or another, on log2 and/or ln(x+1). I think this has to do with the taylor series of both of them as well — R Nar, Nov 19 '15 at 17:05
This is a very interesting observation. I am in no means an expert in low level implementation of efficient computing routines. Intuitively I would guess that this has to do with the fact that all logarithms are conceptually related. If you know one, you basically know them all by simple transformations. So at some point you have to decide which one can be efficiently calculated on a processor. Calculating others via transformation then would obviously take a little more time. But I would love to see an expert answer here. — cel, Nov 19 '15 at 17:05
Perhaps since binary data is base 2, there are some optimisation tricks available with log2 — wim, Nov 19 '15 at 17:09
that probably has to do with the relative simplicity of the [taylor series of `log(x+1)`](http://www.wolframalpha.com/input/?i=taylor+series+for+log%281x%2B1%29) — R Nar, Nov 19 '15 at 17:27
I just tried to test with C code and glibc, using log2/log10/log and didn't see any difference really. But yes, numpy *should* use glibc versions of those functions, there is apparently a fallback to a numpy internal function if for some reason log2 is not in the specific glibc version. — reptilicus, Nov 19 '15 at 18:48
@FermionPortal Would you be interested in writing your comments up as an answer? I could have a go myself, but it seems a shame to let the bounty go to waste ;-) — ali_m, Nov 27 '15 at 13:14
Intuitively, since floating point numbers are already in base 2, you would expect that log2() would do creative "bit banging" to calculate the base 2 logarithm. And that's pretty much what I've found in various algorithms. For example, it's not hard to find the first set bit in the exponent, which gives you the mantissa for log2. For log base 10, you have to go through either Maclurin or Taylor series (the naive way) or other series/table lookup methods. — scooter me fecit, Jun 13 '18 at 03:32

score 12 · Answer 1 · edited Jul 24 '22 at 19:47

I found your question extremely interesting, so I spent a few hours doing a bit of research; I think I have found an explanation for the performance difference as it applies to integers (thank you Matteo Italia for your note) - It is unclear how that reasoning can be extended to floats though:

Computers use base 2 - According to the articles linked in reference, the computation of log2 is a 4 processor cycles process - that of log10 requires to multiply log2(val) by 1/log2(10) which adds another 5 cycles.

Finding log2 is a matter of finding the index of the least significant bit of a value. (video at about the 23rd minute mark onward).

bit hacks: Find integer log base 10 of an integer

bit hacks: Find the log base 2 of an N-bit integer in O(lg(N))

The integer log base 10 is computed by first using one of the techniques above for finding the log base 2. By the relationship log10(v) = log2(v) / log2(10), we need to multiply it by 1/log2(10), which is approximately 1233/4096, or 1233 followed by a right shift of 12. Adding one is needed because the IntegerLogBase2 rounds down. Finally, since the value t is only an approximation that may be off by one, the exact value is found by subtracting the result of v < PowersOf10[t].

This method takes 6 more operations than IntegerLogBase2. It may be sped up (on machines with fast memory access) by modifying the log base 2 table-lookup method above so that the entries hold what is computed for t (that is, pre-add, -mulitply, and -shift). Doing so would require a total of only 9 operations to find the log base 10, assuming 4 tables were used (one for each byte of v).

Of note: the use of DeBruijn sequences lookup & bit shifting techniques to calculate log2 in this MIT video: Lec 2 | MIT 6.172 Performance Engineering of Software Systems, Fall 2010 (video at the 36th minute mark onward).

Of note this StackOverflow post which demonstrates a method to make efficient log2 calculations truly cross platform with C++

Caveat: I have not checked numpy source code to verify that it indeed implements similar techniques, but it would be surprising that it did not. In fact, from the comments under the OP's post, Fermion Portal have checked:

Actually numpy uses math.h from glibc, you'll see the same difference in C/C++ if you use math.h/cmath.h. You can find the nicely commented source code for the two functions e.g. ic.unicamp.br/~islene/2s2008-mo806/libc/sysdeps/ieee754/dbl-64/… and ic.unicamp.br/~islene/2s2008-mo806/libc/sysdeps/ieee754/dbl-64/… – Fermion Portal[9]

Careful, here we are talking about logarithms *on floating point numbers*, the "bit hacks" above do not apply. — Matteo Italia, Nov 30 '15 at 07:23
Oh shoot! Argh! Just when I thought my contribution could be useful! O_O... I'll add a note that says it applies to integers; maybe it can still be useful to some? Thank you for pointing it out Matteo - I learned something anyway! :) — Reblochon Masque, Nov 30 '15 at 07:30

score 8 · Answer 2 · answered Nov 19 '15 at 19:54

8

This is just a note, but longer than a comment. Apparently this has to do with your particular install:

import numpy as np
import numexpr as ne
x = np.random.rand(100000)

I get the same timings with numpy 1.10 from conda and a version compiled with icc:

%timeit np.log2(x)
1000 loops, best of 3: 1.24 ms per loop

%timeit np.log(x)
1000 loops, best of 3: 1.28 ms per loop

I thought it might have something to with grabbing the MKL VML package, but looks like thats a no:

%timeit ne.evaluate('log(x)')
1000 loops, best of 3: 218 µs per loop

Looks like your numpy install is grabbing its log/log2 implementation from two different places which is odd.

answered Nov 19 '15 at 19:54

Daniel

19,179
7
60
74

That's interesting - I'm still seeing very different timings for `np.log` and `np.log2` using numpy 1.10.1 or 1.9.3 from conda, although both of these seem to have been compiled using gcc 4.4.7 rather than icc. I don't have access to a version compiled with icc in order to test this further. – ali_m Nov 19 '15 at 22:17
I got hold of a copy of ICC 16.0.1 and built numpy 1.10.1 from source following the instructions [here](https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl). I'm now seeing slightly worse overall performance, with `np.log` and `np.log10` still being about a factor of 2 slower than `np.log2` and `np.log1p`. – ali_m Nov 20 '15 at 17:24
@ali_m Even more curious. Are you by chance running on a AMD cpu? – Daniel Nov 21 '15 at 14:30
Nope - so far I've tried this on two Intel machines. What is your setup? – ali_m Nov 21 '15 at 15:00
@ali_m All intel machine also. Have you tried `conda`? – Daniel Nov 23 '15 at 16:44
Yes - as I mentioned above, I also tried numpy 1.10.1 and 1.9.3 installed via `conda`. I've also tried them both with and without the optional MKL optimizations. In every case I see at least a factor of 2 difference between `np.log2` and `np.log`. – ali_m Nov 23 '15 at 16:55

score 8 · Answer 3 · edited Dec 01 '15 at 14:25

Disclaimer: I'm neither a credible nor an official source.

I'm almost certain that any implementation of log to the base e function can be made as fast as the log2 function, because to convert one to the other you require a single division by a constant. This is of course assuming that a single division operation is a tiny fraction of the other computations; which in accurate implementations of logarithms is true.

In most cases, numpy uses math.h from glibc, you'll see the same difference in C/C++ if you use math.h/cmath.h. In the comments, some people observe same speeds for np.log and np.log2; I suspect this may come from different builds / platforms.

You can find the nicely commented source code for the two functions in files e_log.c, e_log2.c, e_logf.c, e_log2f.c in the dbl-64/ and flt-32/ subdirectories of this GitHub repo.

For double precision, in glibc, the log function is implementing a completely different algo (compared to log2) from IBM from ~2001, which was included in their libultim library. Whereas log2 is from Sun Microsystems from ~1993. Just looking at the code one can see two different approximations are being implemented. In contrast, for single precision, both functions log and log2 are the same apart from division by ln2 in the log2 case, hence the same speed.

For even more background on the underlying algorithms, alternatives and discussions on which to include in glibc in the future, see here.

Thanks for digging into this. I'm going to award you the bounty, since I think those links have so far provided the most useful insight into the issue. However, I still hesitate to mark this question as closed in case someone else wants to step in with a more comprehensive answer. — ali_m, Dec 01 '15 at 14:14

score 2 · Answer 4 · answered May 19 '18 at 14:57

(Should probably be a comment but will be too long...)

To make this more interesting, in 2018 on a Windows 10 64 bit machine, the results are reversed.

Default Anaconda

Python 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np; np.random.seed(0); x = np.random.rand(100000)
   ...: %timeit np.log2(x)
   ...: %timeit np.log1p(x)
   ...: %timeit np.log(x)
   ...: %timeit np.log10(x)
   ...:
1.48 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.33 ms ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
840 µs ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
894 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Intel Python

Python 3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np; np.random.seed(0); x = np.random.rand(100000)
   ...: %timeit np.log2(x)
   ...: %timeit np.log1p(x)
   ...: %timeit np.log(x)
   ...: %timeit np.log10(x)
   ...:
1.01 ms ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
236 µs ± 6.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
161 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
171 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Why are log2 and log1p so much faster than log and log10, in numpy?

4 Answers4