1

I'm working on a problem where I need to return the fifth power of a digit 0-9 and I had thought I could speed up the program by switching

int pow(int a){return a*a*a*a*a;}

out for

int pow(int a){
switch(a){
case 0: return 0; break;
case 1: return 1;  break;
case 2: return 32;  break;
case 3: return 243;  break;
case 4: return 1024;  break;
case 5: return 3125;  break;
case 6: return 7776;  break;
case 7: return 16807;  break;
case 8: return 32768;  break;
case 9: return 59049;  break;}
return 0;}

but I realized that the program runs about 20% faster with the first function than it does with the second, despite the first requiring 5 multiplication operations and the second only invokes a single switch statement, why is this?

jeremy south
  • 161
  • 1
  • 3
  • Did you look at the generated code? – melpomene Mar 02 '18 at 04:38
  • 2
    because your specific code compiled on your specific compiler on your specific computer with your specific compiler flags generates slower code. You haven't provided us enough information to say anything more than that. If you provide all the above information, then maybe someone can point out the explicit differences in the code that result in the speed difference, but not until then – xaxxon Mar 02 '18 at 04:40
  • Because the multiplication happens to be about 20% faster than the switch statement? Speed isn't based on the number of statements. – user253751 Mar 02 '18 at 04:43
  • 1
    Is this a real bottleneck? Also could be due to branch prediction failure? https://stackoverflow.com/a/11227902/1294207 – Fantastic Mr Fox Mar 02 '18 at 04:44
  • 3
    Fast lookup tables are usually built using arrays, not `switch`. Beware that the `switch` effectively does a range check, which you'll need to write yourself using an array for lookup. – Ben Voigt Mar 02 '18 at 05:00
  • Lookup tables help to avoid multiple comparisons.. But a jump is still needed, that's why the first approach is faster. Actually a compiler usually optimizes a switch into a lookup table by itself. – AndreyS Scherbakov Mar 02 '18 at 05:32
  • No solution involving branching will out-perform simple arithmetic. I don't know what you think is 'simple' about a switch statement. – user207421 Mar 02 '18 at 05:49
  • Unrelated: Possibly faster way to do this `int pow(int a){ static int lookup[] = { 0, 1, 32, ... , 59049 }; return lookup[a];}` Is it faster? Dunno. You'll have to test it. – user4581301 Mar 02 '18 at 06:36
  • @EJP if you or anyone would sanity check my benchmark, I'd appreciate it: stackoverflow.com/a/49065807/493106 – xaxxon Mar 02 '18 at 09:12
  • @user4581301 benchmark from my answer below but with your version included (well actually from an answer below that suggested the same thing...) - see the pow3 version: http://quick-bench.com/1VfMABWFk4RPnbkrnLg5f2hvdx4 – xaxxon Mar 02 '18 at 09:38

3 Answers3

2

It's not as cut and dry as you make it out to be. Depending on your input, either can be faster. In this case, if you look the same value up repeatedly, the table lookup is faster. If you look up different values, the multiplication is faster. I'm guessing that this is the branch predictor doing its job on the lookup with a constant value each time.

benchmark results

Ignore the fact that the "varying" ones are much higher - that's the cost of the modulus. Simply compare the leftmost two with each other and the next two with each other.

live benchmark link: http://quick-bench.com/uZLVxVIMxE21JTsHWJVN8Is-37I

The generated ASM being benchmarked is shown at that link in the bottom right.

int pow(int a){return a*a*a*a*a;}
int pow2(int a){
switch(a){
case 0: return 0; break;
case 1: return 1;  break;
case 2: return 32;  break;
case 3: return 243;  break;
case 4: return 1024;  break;
case 5: return 3125;  break;
case 6: return 7776;  break;
case 7: return 16807;  break;
case 8: return 32768;  break;
case 9: return 59049;  break;}
return 0;}



static void multiply_varying(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  volatile int i = 0;
  for (auto _ : state) {
    i = (i + 1) % 9;
    benchmark::DoNotOptimize(pow(i));
  }
}
// Register the function as a benchmark
BENCHMARK(multiply_varying);

static void lookup_varying(benchmark::State& state) {
  volatile int i = 5;
  for (auto _ : state) {
        i = (i + 1) % 9;
    benchmark::DoNotOptimize(pow2(i));
  }
}
BENCHMARK(lookup_varying);

static void multiply_constant(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  volatile int i = 5;
  for (auto _ : state) {
    benchmark::DoNotOptimize(pow(i));
  }
}
// Register the function as a benchmark
BENCHMARK(multiply_constant);


static void lookup_constant(benchmark::State& state) {
  volatile int i = 5;
  for (auto _ : state) {
    benchmark::DoNotOptimize(pow2(i));
  }
}
BENCHMARK(lookup_constant);

edit: Slightly different benchmark has the lookup being faster in both cases: http://quick-bench.com/NRdzldykfQ8cQmGEn33FG0LMr2Q

xaxxon
  • 19,189
  • 5
  • 50
  • 80
0

Nothing surprising: in the first case (power calculation), a variable and the result will be stored in a processor cache between subsequent multiplications so you only expect one (far) memory read to happen if you use the first option. Note that a memory access may be dozen times slower than a multiplication itself.

If you use a switch, you need a memory read plus a control jump to a label (which is often another memory read of executable code). So this way will take more runtime.

P.S. Adding assembly code samples (see below)

With multiplications:

    movl    16(%rbp), %eax
    imull   16(%rbp), %eax
    imull   16(%rbp), %eax
    imull   16(%rbp), %eax
    imull   16(%rbp), %eax

With a switch -> jump to a calculated address

    movl    16(%rbp), %eax
    leaq    0(,%rax,4), %rdx
    leaq    .L4(%rip), %rax
    movl    (%rdx,%rax), %eax
    movslq  %eax, %rdx
    leaq    .L4(%rip), %rax
    addq    %rdx, %rax
    jmp     *%rax
AndreyS Scherbakov
  • 2,674
  • 2
  • 20
  • 27
  • it's dangerous to speculate on the generated code. It's better to have the person asking the question actually provide sufficient code to reproduce the problem and answer from the actual data. – xaxxon Mar 02 '18 at 04:46
  • 1
    @xaxxon, This is a microarchitecture optimization. You cannot see caching in your generated code, you will see 5 accesses to the same logical address, exactly the same as in the C code. – AndreyS Scherbakov Mar 02 '18 at 04:48
  • Even if you are correct, it's best to make the user provide their code so that you're sure they're not doing something else that is impacting their timings (or that their timings are even correct) and that they aren't compiling with non-sensical compiler flags. If nothing else, if you want to answer this, you should make example code for it so that you can point to actual generated code in your answer to back it up. – xaxxon Mar 02 '18 at 04:53
  • @xaxxon I don't know what you're speaking of. He *has* provided his code. – user207421 Mar 02 '18 at 05:56
  • @EJP assembly code is probably meant - I've added one generated with gcc – AndreyS Scherbakov Mar 02 '18 at 06:39
  • In case you're wondering, I actually benchmarked the code instead of just speculating. My answer is here: https://stackoverflow.com/a/49065807/493106 – xaxxon Mar 02 '18 at 09:03
0

The switch statement is probably slower due to pipeline stalls when branching.

It would be more usual to use a table lookup than a switch statement, as follows:

int pow(int a){
  static const int pows[10] = { 0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049 };
  if (a < 0 || a > 9) return 0;
  return pows[a];
}

But that may also be slow due to the conditional range check.

Ross Bencina
  • 3,822
  • 1
  • 19
  • 33
  • My benchmark but with this version included as "pow3": http://quick-bench.com/1VfMABWFk4RPnbkrnLg5f2hvdx4 – xaxxon Mar 02 '18 at 09:36