6

[Short answer: Bad benchmarking methodology. You'd think I'd have figured that out by now.]

The problem is presented as "find a method for calculating x^y quickly, where x and y are positive integers". A typical "fast" algorithm looks like this:

public long fastPower(int x, int y) {
  // Replaced my code with the "better" version described below,
  // but this version isn't measurably faster than what I had before
  long base = x; // otherwise, we may overflow at x *= x.
  long result = y % 2 == 1 ? x : 1;
  while (y > 1) {
    base *= base;
    y >>= 1;
    if (y % 2 == 1) result *= base;
  }

  return result;
}

I wanted to see how much faster this is than say, calling Math.pow(), or using a naive approach like multiplying x by itself y times, like this:

public long naivePower(int x, int y) {
  long result = 1;
  for (int i = 0; i < y; i++) {
    result *= x;
  }
  return result;
}

Edit: Okay, it's been pointed out to me (correctly) that my benchmarking code was not consuming the result, and that totally throws everything off. Once I started consuming the result, I'm still seeing the naive approach being about 25% faster than the "fast" approach.

Original text:

I was very surprised to find that the naive approach was 4x faster than the "fast" version, which was itself about 3x faster than the Math.pow() version.

My test is using 10,000,000 trials (and then 100 million, just to make absolutely sure the JIT has time to warm up), each using random values (to prevent calls from being optimized away) 2 <= x <= 3, and 25 <= y <= 29. I chose a narrow range of values that wouldn't produce a result greater than 2^63, but were biased with a larger exponent to attempt to give the "fast" version an advantage. I'm pre-generating 10,000 pseudorandom numbers to eliminate that part of the code from the timing.

I understand that for small exponents, the naive version could be expected to be faster. The "fast" version has two branches instead of one, and will generally do twice as many arithmetic/storage operations as the naive one - but I would expect for large exponents, that still would lead to the fast approach saving half as many operations in the best case, and being about the same in the worst case.

Anyone have any idea why the naive approach would be so much faster than the "fast" version, even with data biased toward the "fast" version (i.e. larger exponents)? Does the extra branch in that code account for that much of a difference at runtime?

Benchmarking code (yes I know I should use some framework for "official" benchmarks, but this is a toy problem) - updated to warm up, and consume results:

PowerIf[] powers = new PowerIf[] {
  new EasyPower(), // just calls Math.pow() and cast to int
  new NaivePower(),
  new FastPower()
};

Random rand = new Random(0); // same seed for each run
int randCount = 10000;
int[] bases = new int[randCount];
int[] exponents = new int[randCount];
for (int i = 0; i < randCount; i++) {
  bases[i] = 2 + rand.nextInt(2);
  exponents[i] = 25 + rand.nextInt(5);
}

int count = 1000000000;

for (int trial = 0; trial < powers.length; trial++) {
  long total = 0;
  for (int i = 0; i < count; i++) { // warm up
    final int x = bases[i % randCount];
    final int y = exponents[i % randCount];
    total += powers[trial].power(x, y);
  }
  long start = System.currentTimeMillis();
  for (int i = 0; i < count; i++) {
    final int x = bases[i % randCount];
    final int y = exponents[i % randCount];
    total += powers[trial].power(x, y);
  }
  long end = System.currentTimeMillis();
  System.out.printf("%25s: %d ms%n", powers[trial].toString(), (end - start)); 
  System.out.println(total);
}

Produces output:

                EasyPower: 7908 ms
-407261252961037760
               NaivePower: 1993 ms
-407261252961037760
                FastPower: 2394 ms
-407261252961037760

Playing with the parameters for the random numbers and the trials does change the output characteristics, but the ratios between the tests are always consistently the same as those shown.

Brian
  • 71
  • 1
  • 6
  • Sure about which operation the `if (odd)` controls? – greybeard Feb 27 '16 at 05:51
  • Fast algorithms for exponentiation aren't worth much for machine integer arithmetic. Only if the cost of a multiplication is significantly bigger than any of the decrement of shifts will "fast" become better. Try again using BigInteger. – laune Feb 27 '16 at 05:54
  • 2
    I cannot reproduce these results, for me the fast one is actually faster by about 40% . Can you please post the code you used to benchmark that? I suspect that there is some flaw in there. Benchmarking code on the JVM can be tricky... – MartinS Feb 27 '16 at 05:54
  • @Brian which inputs did you test this for, and what timing results did you get for those? – Mike 'Pomax' Kamermans Feb 27 '16 at 05:56
  • A coarse (naive) comparison of machine ops and memory access for x=2 and y=48 shows Fast:34 46, Naive: 144 192. Which is about the O(log) you can expect. Java benchmarking is difficult, but we should see how you did it. – laune Feb 27 '16 at 05:58
  • @greybeard, that is actually if(even). – Brian Feb 27 '16 at 16:58
  • The fact that you guys are getting the expected results does point to a flaw in my testing methodology, but I can't see it yet. And probably not worth wasting your time with debugging the testing framework. Thanks. – Brian Feb 27 '16 at 16:59
  • `that is actually if(even)` - think about it. Or consult an implementation you like, or wikipedia. – greybeard Feb 27 '16 at 17:08
  • @greybeard - (foo % 2 == 0) is asking if when you divide foo by two, the remainder is zero - that is true for EVEN numbers. I appreciate your time and feedback. – Brian Feb 27 '16 at 17:14
  • Why, thanks - after all those decades, I might have forgotten. Regardless, what I seem to be hinting at too feebly is that you didn't present code doing the usual implementation: watch it using whatever process visualisation you prefer, check the one [dymo414](http://stackoverflow.com/a/35666472/3789665) links in his answer, or consult wikipedia on [exponentiating by squaring](https://en.wikipedia.org/wiki/Exponentiation_by_squaring). – greybeard Feb 27 '16 at 17:26
  • I'm not trying to find the "most optimal" solution - I'm trying to test this particular implementation. You should agree though, that my method (which I have verified produces the correct results) should require many fewer iterations than the naive version (O(logN) vs O(N)) where N is say 30. – Brian Feb 27 '16 at 17:38
  • Sorry, the code I presented had a bug now duplicated here: `x`/`base` better be `long` (if you don't change it in 48h time, I will). – greybeard Feb 29 '16 at 10:33
  • The best explanation I can offer for the unexpected good runtime of `NaivePower` in spite of doing way more multiplications is that it profits most from unrolling/avoidance of bubbles in the multiplier pipeline: see conditional multiplication vs. conditional multiplicand in guava's `LongMath.pow()`. – greybeard Feb 29 '16 at 10:38
  • @greybeard - Thanks for reminding me about the long base - I had made that fix in my local version, and forgot to edit it here. Regarding the multiplication pipeline, that makes perfect sense. God I love JIT voodoo! – Brian Mar 01 '16 at 15:44

4 Answers4

8

There are two issues with your fastPower:

  1. It's better to replace y % 2 == 0 with (y & 1) == 0; bitwise operations are faster.
  2. Your code always decrements y and performs extra multiplication, including the cases when y is even. It's better to put this part into else clause.

Anyway, I guess that your benchmarking method is not perfect. 4x performance difference sounds weird and cannot be explained without seeing the complete code.

After applying above improvements I've verified using JMH benchmark that fastPower is indeed faster than naivePower with the factor of 1.3x to 2x.

package bench;

import org.openjdk.jmh.annotations.*;

@State(Scope.Benchmark)
public class FastPow {
    @Param("3")
    int x;
    @Param({"25", "28", "31", "32"})
    int y;

    @Benchmark
    public long fast() {
        return fastPower(x, y);
    }

    @Benchmark
    public long naive() {
        return naivePower(x, y);
    }

    public static long fastPower(long x, int y) {
        long result = 1;
        while (y > 0) {
            if ((y & 1) == 0) {
                x *= x;
                y >>>= 1;
            } else {
                result *= x;
                y--;
            }
        }
        return result;
    }

    public static long naivePower(long x, int y) {
        long result = 1;
        for (int i = 0; i < y; i++) {
            result *= x;
        }
        return result;
    }
}

Results:

Benchmark      (x)  (y)   Mode  Cnt    Score   Error   Units
FastPow.fast     3   25  thrpt   10  103,406 ± 0,664  ops/us
FastPow.fast     3   28  thrpt   10  103,520 ± 0,351  ops/us
FastPow.fast     3   31  thrpt   10   85,390 ± 0,286  ops/us
FastPow.fast     3   32  thrpt   10  115,868 ± 0,294  ops/us
FastPow.naive    3   25  thrpt   10   76,331 ± 0,660  ops/us
FastPow.naive    3   28  thrpt   10   69,527 ± 0,464  ops/us
FastPow.naive    3   31  thrpt   10   54,407 ± 0,231  ops/us
FastPow.naive    3   32  thrpt   10   56,127 ± 0,207  ops/us

Note: Integer multiplication is quite fast operation, sometimes even faster than an extra comparison. Don't expect huge performance improvements with the values that fit into long. The advantage of fast power algorithm will be evident on BigInteger with larger exponents.

Update

Since the author posted the benchmark, I must admit that surprising performance results come from the common benchmarking pitfalls. I've improved the benchmark while retaining the original methodology, and now it shows that FastPower is indeed faster than NaivePower, see here.

What are the key changes in the improved version?

  1. Different algorithms should be tested separately in different JVM instances to prevent profile pollution.
  2. A benchmark must be called several times to allow proper compilation/recompilation until the steady state is reached.
  3. One benchmark trial should be placed in a separate method to avoid on-stack replacement issues.
  4. y % 2 is replaced with y & 1 since HotSpot does not perform this optimization automatically.
  5. Minimized the effect of unrelated operations in the main benchmark loop.

Writing microbenchmarks manually is a difficult tasks. That's why it is strongly recommended to use proper benchmarking frameworks like JMH.

Community
  • 1
  • 1
apangin
  • 92,924
  • 10
  • 193
  • 247
  • You're right about the else thing. The bitwise and vs the mod doesn't matter - Java optimizes that out anyway. Neither change impacted the results of my test. This is a toy problem, specifically for native ints, so using BigInteger defeats the purpose. – Brian Feb 27 '16 at 17:02
  • 2
    @Brian The assembly dump and JMH benchmark results prove that `&` **does** produce better code than `%`. – apangin Feb 27 '16 at 17:10
  • 1
    @Brian *'Neither change impacted the results of my test'* - this also indicates that your benchmark measures not what you expect it to measure. – apangin Feb 27 '16 at 17:14
  • Perhaps, but making that change had no measurable effect on the runtime in my benchmark. Meaning that the runtime is dominated by some other operation. – Brian Feb 27 '16 at 17:15
  • I've posted my benchmarking code - for the life of me, I don't see anything obviously wrong with it. Maybe you see what I am missing? – Brian Feb 27 '16 at 17:15
  • 1
    @Brian From my experience, `(x%2)==0` is *not* optimized to `(x&1)==0` and that *does* make a difference. C compilers usually do, Java does not. How did you verify that Java does optimize this? – Has QUIT--Anony-Mousse Feb 27 '16 at 17:18
  • 1
    @Brian Yes, there is a number of common mistakes in your benchmark. 1. It measures OSR'ed code; 2. It does different measurements in a single JVM, even in a single loop. 3. The function results are not *consumed*. 4. There is no warm-up, etc. See [How do I write a correct micro-benchmark in Java?](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) – apangin Feb 27 '16 at 17:22
  • @Anony-Mousse for Java there is a difference between compile time and runtime optimization. That javac does not optimize it does not mean that the JIT does not. I am pretty sure the JIT will optimize it here. – MartinS Feb 27 '16 at 17:22
  • I was assuming that (I know I know) - but empirically, the runtime did not change in a statistically significant way when I made this change, which means my runtime is being dominated by some other operation. And my goal here isn't to find the fastest implementation - it's to figure out why my test is showing the naive implementation to be 4x faster. – Brian Feb 27 '16 at 17:24
  • @apangin - OSR v JIT shouldn't matter with so many millions of iterations. Warm up made no difference - just ran a test with that. But you're absolutely right about forgetting to consume the output. Duh. But! Even after making those changes, I'm still seeing the naive function running about 25% faster than the "fast" version. – Brian Feb 27 '16 at 17:30
  • @Brian That's not true. OSRed code due to HotSpot compiler nature is generated differently from the code compiled from scratch. Try also leaving `FastPower` alone in `powers` array; its time will likely change. – apangin Feb 27 '16 at 17:59
  • @apangin - I just tried having FastPower alone in the array, and it's time changed only slightly. I also tried it for the naive version. And the naive version is still slightly faster. So anyway, how do you recommend I deal with the OSRed code, if you think that really makes a big difference in my benchmark. Thanks! – Brian Feb 27 '16 at 18:04
  • 1
    @Brian I've improved your benchmark while retaining the original methodology. [See here](https://gist.github.com/apangin/91c07684635893e3f1d5). Now it shows that `FastPower` is indeed faster. BTW, what Java version did you use? I tested on JDK 8u60 x64. – apangin Feb 27 '16 at 18:37
  • @apangin can you explain why the original code shows this behavior? – MartinS Feb 27 '16 at 19:40
  • @apangin Very cool - I actually tried running them individually in separate VMs, but did not observe a significant difference. I am running with Java 1.7.0_95 (OpenJDK Runtime Environment (IcedTea 2.6.4)). Aside from your code being cleaner than mine, I don't see any functional difference... do you think the difference in outcome between our runs is entirely due to the structural differences you made? Or are your results different because of a better JVM? – Brian Feb 27 '16 at 20:10
  • @apangin - I just downloaded and ran your code on my system (should have done that before, sorry). And I get results that mirror yours. So... why do the structural differences between my code and your code make such a huge difference? (And recall that this was running with my array having only one instance in it.) – Brian Feb 27 '16 at 20:16
  • @Brian @MartinS Yes, structural changes make difference. Remember that JIT compilation unit is a method. Never put all the benchmarking code in a single `main` method. I've updated the answer regarding the benchmarking issues. – apangin Feb 27 '16 at 21:05
  • @apangin - Thanks so much! FYI - The mod vs bitwise and did NOT affect the outcome for me, using your code (I tried it both ways). – Brian Feb 27 '16 at 21:38
  • @MartinS yes, and you cannot afford all optimizations at runtime in hotspot. C compilation takes a lot longer than JIT for a reason. My benchmarks says Hotspot does not perform this optimization at runtime either. – Has QUIT--Anony-Mousse Feb 27 '16 at 22:48
  • @Anony-Mousse when looking at the assembly code generated by the JIT I could see that it does optimize the `% 2` but there might be differences between the client VM and the server VM and different versions and god knows what. – MartinS Feb 28 '16 at 23:25
2

Without being able to review and replicate your benchmark there's little point in trying to break down your results. They could be due to a poor selection of inputs, erroneous benchmarking practices such as running one test before another (thus giving the JVM time to "warm up"), and so on. Please share your benchmarking code, not just your results.

I would suggest including in your tests Guava's LongMath.pow() (src), that is a heavily used and well benchmarked method. While you might be able to beat it with certain inputs it's unlikely you can improve on its runtime in the general case (and if you can, they'd love to hear about it).

It is unsurprising that Math.pow() would perform worse than positive-integer-only algorithms. Looking at the "fast" vs. "naive" implementations it's clearly very much dependent on the inputs you choose as Mike 'Pomax' Kamermans suggests. For small values of y the "naive" solution obviously has to do less work. But for larger values we save a good number of iterations with the "fast" implementation.

dimo414
  • 47,227
  • 18
  • 148
  • 244
  • small detail to add: `Math.pow()` has to convert the input to `double` first which comes with a certain cost. – MartinS Feb 27 '16 at 06:28
0

In my eyes, the first fastPower(base, exponent) from the question has been wrong, if not giving erroneous results. (The first version of intPower() below was buggy, as in giving wrong results, in addition to slightly misleading benchmark results.)
Due to comment "formatting capabilities", another rendition of exponentiation by squaring to argue about as an answer:

static public long intPower(int base, int exponent) {
    if (0 == base
        || 1 == base)
        return base;
    int y = exponent;
    if (y <= 0)
        return 0 == y ? 1 : -1 != base ? 0 : y % 2 == 1 ? -1 : 1;
    long result = y % 2 == 1 ? base : 1,
        power = base;
    while (1 < y) {
        power *= power;
        y >>= 1; // easier to see termination after Type.SIZE iterations
        if (y % 2 == 1)
            result *= power;
    }
    return result;
}

If you do microbenchmarks (what is typical use of integer exponentiation?), do an appropriate warm-up, if by using a framework. Never invest time in microbenchmark results for timing runs taking less than 5 seconds per alternative.

One alternative derived from guava's LongMath.pow(b, e):

public long power(int base, int k) {
    for (long accum = 1, b = base ;; k >>>= 1)
        switch (k) {
        case 0:
            return accum;
        case 1:
            return accum * b;
        default:
            if ((k&1) != 0) // guava uses conditional multiplicand
                accum *= b;
            b *= b;
        }
}
greybeard
  • 2,249
  • 8
  • 30
  • 66
  • It's wrong in that it's not optimal - I should have had an else block. But it did still produce the correct result. Unfortunately, even after fixing that, and changing the even check to use & instead of %, the results are exactly the same. So either it's the branching that's slow, or there's something seriously wrong with my testing methodology. – Brian Feb 27 '16 at 17:00
  • Okay, your implementation is definitely cooler than mine. So I changed mine to match (though I don't care about non-positive values). But it only marginally increased the performance vs mine, and it's still slower than the naive implementation. – Brian Feb 27 '16 at 17:43
-1

The while loop runs log2(y) times, while the for loop runs runs y times, so depending on your input, one will run faster than the other.

The while loop, worst case, runs:

  1. a compare (while conditional)
  2. a modulo,
  3. a compare, and worst case three more ops:
  4. a multiplication
  5. a bit shift assignment
  6. another multiplication, and finally,
  7. a decrement.

Whereas the naive for loop runs:

  1. a compare (for conditional),
  2. a multiplication, and
  3. an increment (for iterator)

so you'd expect the naive loop to be faster for small values of y, because those fewer number of operations in the for loop outperform the log2 reduction of the "fast" approach, as long as the time lost for those extra operations is more than the time gained by log2 reducing y.

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
  • 3
    "Both run y times" just isn't true. -1, – laune Feb 27 '16 at 05:51
  • look at the `y >>= 1`, the fast one does not run y times – MartinS Feb 27 '16 at 05:51
  • And what does the `y>>=1 `do? – laune Feb 27 '16 at 05:52
  • @laune right shift and assign, so as long as y really is integer, it's the same as `y /= 2` – Mike 'Pomax' Kamermans Feb 27 '16 at 05:54
  • (Part of the problem is this "analysis of fast" uses the erroneous implementation of _fast_ from the question.) – greybeard Feb 27 '16 at 05:57
  • I'll happily delete this answer once a better one comes along. – Mike 'Pomax' Kamermans Feb 27 '16 at 05:58
  • `depending on [circumstances], one [alternative] will run faster than the other` - that's got to be the quintessential comparative run-time analysis. – greybeard Feb 27 '16 at 05:59
  • I'm doing *zero* true runtime analysis here, someone else can do this. The compiled code can get optimized to different instructions, and until someone else is willing to sit down and run the numbers for Brian (which he should be doing himself), a superficial op count with the observation that one runs O(log2(n)) but with more ops, and the other O(n) with fewer ops, seems an appropriate level of analysis here. – Mike 'Pomax' Kamermans Feb 27 '16 at 06:01
  • I've decompiled my own code, thank you very much, and what I see is that the "fast" version has about twice as many arithmetic instructions and branches as the naive version. But, for an exponent of say 30, I expect the "fast" version to only go through those statements 5 times, vs 30 for the naive version. So for y=30, I expect the "fast" version to be two to three times faster... but I don't see any reason for it to be 25% slower for large y. – Brian Feb 27 '16 at 17:56
  • where's your test code in your post that shows how you're comparing these methods, and the test result data for `y` ranging from 1 to, say, 1000? over random integer `x`? I only see a single instance comparison: that's not reliable data, that's *always* a fluke. – Mike 'Pomax' Kamermans Feb 27 '16 at 18:39
  • @Mike'Pomax'Kamermans - y over [1,1000] wouldn't make any sense for calculations that produce a result as a 64 bit long. I've updated my description to give a better idea of what I'm doing. – Brian Feb 27 '16 at 20:13
  • On tasks like this, theoretical runtime analysis is useless. **Constant factors cannot be neglected.** And even CPU cycle counting won't help, because we have a virtual machine in here. – Has QUIT--Anony-Mousse Feb 27 '16 at 22:51
  • It's absolutely not useless: it gives a cursory, absolutely not accurate, but informative insight in what *might* be going on, which is a category of Stackoverflow answers: if you have the answer, post it, but if you simply have information that'll help the person asking the question along in their quest for possible in-routes to an answer, those should 100% be given too, because they're helpful. This answer all but literally explains that constant factors cannot be neglected here already, and a proper analysis involved analysing the JVM bytecode. – Mike 'Pomax' Kamermans Feb 29 '16 at 06:11