C/C++ fastest cmath log operation

Question

I'm trying to calculate log_ab (and get a floating point back, not an integer). I was planning to do this as log(b)/log(a). Mathematically speaking, I can use any of the cmath log functions (base 2, e, or 10) to do this calculation; however, I will be running this calculation a lot during my program, so I was wondering if one of them is significantly faster than the others (or better yet, if there is a faster, but still simple, way to do this). If it matters, both a and b are integers.

In the words of Donald Knuth: "*We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil*" — You, Jul 11 '11 at 20:09
@You *"mindless mantras are anethema to productive thought"* — Tamás Szelei, Jul 11 '11 at 20:15
@You - I've always felt that quote is over utilized. Certainly there are a circumstances where you can spend a lot of effort, reduce the readability of the code, and end up not noticing the difference. There are plenty of other cases where you can spend very little effort, not affect the readability at all, and make a huge improvement. Knowing which case is which comes with practice, unless you stop considering the prospect altogether. — Mark Ransom, Jul 11 '11 at 20:18
@Mark: That is a good point, but in this case there's probably no need to worry about the efficiency of standard math functions. There rarely is, as they're likely to be implemented in an optimal way on whatever architecture you're compiling for. — You, Jul 11 '11 at 20:45
@You: multiply, add, and subtract are all *much* faster than log, exp, and trig. Sqrt and divide are in between. (Intel Skylake has a *very* fast FP divide unit, but it's still [a factor of 8 worse throughput, and a factor of ~3 worse latency than FP mul](http://agner.org/optimize/). sqrt is only slightly slower). It's a lot faster to check a geometric mean as `(x^2+y^2) < maxdistance^2` instead of `sqrt(x^2+y^2) < maxdistance`, esp. if you're doing this check repeatedly (like in a Mandelbrot inner loop), or with integers. (x86 scalar integer division is slower than SIMD FP division.) — Peter Cordes, Jan 31 '16 at 04:16
@PeterCordes: Yes, I know that. My point is (and was, five years ago when this was first posted) this: Starting with the most obvious expression makes your code more maintainable. If that turns out to be too slow (most of the time it won't, because it's not a critical path of your code or the compiler is able to optimize), apply suitable transformations which you also clearly document. **Measure first, _then_ worry about optimization.** — You, Jan 31 '16 at 10:45
@You: I disagree with the philosophy of being totally performance-agnostic when first writing. I think both ways are more or less equally readable for the example I gave, which is why I chose it. I can usually think of a ton of ways to do something, and having some idea about which will probably perform better actually helps me pick between ones that are equally readable and short. I've always liked efficient stuff, and couldn't stop myself from thinking about it, so part of that is just my way of thinking, and wouldn't work well for everyone. — Peter Cordes, Jan 31 '16 at 10:53
I'm not advocating sacrifices in maintainability or readability (until after you measure and decide to look for a speedup). I hate the argument that every performance choice comes with a sacrifice in readability, because it's often not true. **My point** was that it's not a question of trying to beat the math library implementations of functions, it's about **choosing the right pieces from that set of building blocks, and putting them together so you use the expensive ones less.** — Peter Cordes, Jan 31 '16 at 10:55
By all menas if you can think of two obvious ways to do something and one of them involves fewer/simpler operations then use that one. But don't go out of your way to optimize unless you have profiled and identified an actual issue; the compiler is probably more clever than you (especially for hardware-specific optimizations). — You, Jan 31 '16 at 11:33
@You That is a partial and indeed selective quotation. I suggest you read the rest of it, particularly the final sentence. The full quotation is as follows: *'Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.'* — user207421, Jan 31 '16 at 23:21
@EJP: Yes, and you have to measure to know which 3% are critical and therefore warrant manual optimization. — You, Feb 01 '16 at 06:31
I like [Linus Torvald's ideas on writing nice code](http://meta.slashdot.org/story/12/10/11/0030249/linus-torvalds-answers-your-questions). It's pretty similar to what I was saying. "*So there's lots of pride in doing the small details right. It may not be big and important code, but I do like seeing code where people really thought about the details, and clearly also were thinking about the compiler being able to generate efficient code (rather than hoping that the compiler is so smart that it can make efficient code* despite *the state of the original source code).*" — Peter Cordes, Feb 01 '16 at 23:42

score 15 · Answer 1 · edited Jul 30 '15 at 20:51

15

First, precalculate 1.0/log(a) and multiply each log(b) by that expression instead.

Edit: I originally said that the natural logarithm (base e) would be fastest, but others state that base 2 is supported directly by the processor and would be fastest. I have no reason to doubt it.

Edit 2: I originally assumed that a was a constant, but in re-reading the question that is never stated. If so then there would be no benefit to precalculating. If it is however, you can maintain readability with an appropriate choice of variable names:

const double base_a = 1.0 / log(a);
for (int b = 0; b < bazillions; ++b)
    double result = log(b) * base_a;

Strangely enough Microsoft doesn't supply a base 2 log function, which explains why I was unfamiliar with it. Also the x86 instruction for calculating logs includes a multiplication automatically, and the constants required for the different bases are also available via an optimized instruction, so I'd expect the 3 different log functions to have identical timing (even base 2 would have to multiply by 1).

edited Jul 30 '15 at 20:51

RaveTheTadpole

1,515
13
25

answered Jul 11 '11 at 20:09

Mark Ransom

299,747
42
398
622

+1 Nice observation. Although the other answer has a point since these operations are so fast anyway optimizing it is a little much. – Jesus Ramos Jul 11 '11 at 20:11
@Jesus Ramos, I always give the question the benefit of the doubt. I've worked on pixel operations that are performed millions at a time when the user expects instant results, and it doesn't take much imagination to think that someone else might have similar constraints. – Mark Ransom Jul 11 '11 at 20:14
1

That's true but optimization by newer compilers almost removes the need to do some of these things which is kind of sad. Although most people still write out optimizations like these. – Jesus Ramos Jul 11 '11 at 20:14
1

I'm pretty sure that the logarithm to base 2 is the fastes, since its the only one to have its own assembler instruction (at least in 8087 instruction set). – MartinStettner Jul 11 '11 at 20:15
@Jesus Ramos, I don't think that divides by a constant are often replaced with multiplies by the inverse, since the compiler can't guarantee that the lowest bits of the result would be identical. I think divides are still slower than multiplies, even on modern processors, and you would notice a difference. – Mark Ransom Jul 11 '11 at 20:54
I think what happens in some (at least ICC does this for some instructions) is see the repetitive used value and save that into memory so if log(constant) is being used a lot it gets calculated once and stored somewhere so its not always recalculated. – Jesus Ramos Jul 11 '11 at 20:55
3

@Mark: exactly correct. Unless a flag like `fast-math` is set, the compiler cannot replace division with a reciprocal multiply (not only because it changes rounding, but also because it can cause spurious overflows and NaNs). And yes, divide is *much* slower than multiplication on most current processors. – Stephen Canon Jul 11 '11 at 20:56
2

The log instructions are unlikely to be used by modern compilers, as they are inaccurate (relative to what you can do in software) and only available in the deprecated x87 instruction set. – zwol Jan 30 '16 at 14:19

score 11 · Accepted Answer · answered Jul 11 '11 at 20:18

11

Since b and a are integers, you can use all the glory of bit twiddling to find their logs to the base 2. Here are some:

Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)
Find the integer log base 2 of an integer with an 64-bit IEEE float
Find the log base 2 of an integer with a lookup table
Find the log base 2 of an N-bit integer in O(lg(N)) operations
Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup

I'll leave it to you to choose the best "fast-log" function for your needs.

answered Jul 11 '11 at 20:18

Jacob

34,255
14
110
165

4

Cool link. I wonder if any of those methods are faster than the straight-forward ones on modern processors? – Mark Ransom Jul 11 '11 at 20:24
1

@MarkRansom: They'll all be much slower than using a hardware instruction to count leading zeros, [which all modern architectures have](https://en.wikipedia.org/wiki/Find_first_set#Hardware_support), because the looping methods will branch-mispredict with different inputs. The hw insn is very cheap. x86's original `bsr` is clunky (undefined if input is 0), but does give the log2 result directly, instead of 32-log2(a). Only quite recent CPUs support `lzcnt`, which is defined for `0`. AVX512CD will introduce `VPLZCNTD`, which does it on every element in a vector of integers. – Peter Cordes Jan 31 '16 at 04:29
1

Also, I thought the OP was looking for a solution that didn't round / truncate intermediate results to integer. Sure the starting values are integer, but `integer_log2(uint32_t)` only has a range of 0..32, so the fractional part makes a *big* difference. There's a huge range of numbers between 2^30 and 2^31, but they all have the same ilog2. – Peter Cordes Jan 31 '16 at 04:33

score 4 · Answer 3 · answered Jul 11 '11 at 20:13

On the platforms for which I have data, log2 is very slightly faster than the others, in line with my expectations. Note however, that the difference is extremely slight (only a couple percent). This is really not worth worrying about.

Write an implementation that is clear. Then measure the performance.

score 1 · Answer 4 · answered Jul 11 '11 at 20:14

1

In the 8087 instruction set, there is only an instruction for the logarithm to base 2, so I would guess this one to be the fastest.

Of course this kind of question depends largely on your processor/architecture, so I would suggest to make a simple test and time it.

answered Jul 11 '11 at 20:14

MartinStettner

28,719
15
79
106

On modern x86 processors, it turns out that good software log implementations are faster than the hardware log instruction, so the base used in that instruction is really of no consequence to the question. – Stephen Canon Jul 11 '11 at 20:15
2

@Stephen Canon, if you have a link to some timing results this would be a great place to trot it out. – Mark Ransom Jul 11 '11 at 20:26
7

@Mark Ransom, the Intel Optimization Manual lists `fyl2x` as having a throughput of 1 result every 85 cycles, and a latency of between 140 and 190 cycles. By contrast, using the system math library on OSX on my MacBook Pro, I measure `log2( )` as having a latency of ~72 cycles and throughput of 1 result every 46 cycles. So the software implementation is approximately twice as fast here. – Stephen Canon Jul 11 '11 at 21:12

score 0 · Answer 5 · answered Jul 11 '11 at 20:11

The answer is:

it depends
profile it

You don't even mention your CPU type, the variable type, the compiler flags, the data layout. If you need to do lot's of these in parallel, I'm sure there will be a SIMD option. Your compiler will optimize that as long as you use alignment and clear simple loops (or valarray if you like archaic approaches).

Chances are, the intel compiler has specific tricks for intel processors in this area.

If you really wanted you could use CUDA and leverage GPU.

I suppose, if you are unfortunate enough to lack these instruction sets you could go down at the bit fiddling level and write an algorithm that does a nice approximation. In this case, I can bet more than one apple-pie that 2-log is going to be faster than any other base-log

C/C++ fastest cmath log operation

5 Answers5