-1

I'm currently benchmarking several algorithms in C code. I recognized the following behavior I cannot explain:

When comparing the execution times of the pow() and powf() function of the math.h library, executing the powf() function is two times slower than pow().

I used powf() with float values only and pow() with double values, so there should not be any implicit type conversion.

I execute the code on a beaglebone black and using gcc to compile it. Currently, I do not use any optimization flags. If using -O3, the execution times are nearly the same.

Is there an explanation why powf() is so much slower?

Here is a minimal example of what I did:

#include<time.h>
#include <stdio.h>
#include <math.h>

struct timespec diff_time(struct timespec start, struct timespec end)
{
    struct timespec temp;
    if ((end.tv_nsec - start.tv_nsec) < 0) {
        temp.tv_sec = end.tv_sec - start.tv_sec - 1;
        temp.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec;
    }
    else {
        temp.tv_sec = end.tv_sec - start.tv_sec;
        temp.tv_nsec = end.tv_nsec - start.tv_nsec;
    }
    return temp;
}

int main() {

    struct timespec time1, time2;
    double time_diff; 

    double result=0;
    float resultf=0;

    double value = 234.2348;
    float valuef = 234.2348f;

    int j;

    int select_switch = 1; 

    //TIC
    clock_gettime(CLOCK_REALTIME, &time1);

    
    if (select_switch == 1) {

        for (j = 0; j < 1000; j++)
        {
            result = pow(value, 2);
        }
    }

    if (select_switch == 2) {
        for (j = 0; j < 1000; j++)
        {
            resultf = powf(valuef, 2.0f);
        }
    }

    if (select_switch == 4) {
        for (j = 0; j < 1000; j++)
        {
            resultf = valuef * valuef;
        }
    }


    if (select_switch == 5) {
        for (j = 0; j < 1000; j++)
        {
            result = value * value;
        }
    }

    /* TOC */

    clock_gettime(CLOCK_REALTIME, &time2);
    time_diff = diff_time(time1, time2).tv_sec * (1e3) +
        (diff_time(time1, time2).tv_nsec) * (1e-6); // in Milli Seconds 

    printf("%lf", result); 
    printf("%f", resultf); 
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
Berni
  • 7
  • 2
  • Does this answer your question? [How is pow() calculated in C?](https://stackoverflow.com/questions/40824677/how-is-pow-calculated-in-c) – Robert Harvey Jul 27 '21 at 15:18
  • 2
    Can you post a [mre](https://stackoverflow.com/help/minimal-reproducible-example) ? On x86 it is easy to forget about [`-ffloat-store`](https://gcc.gnu.org/wiki/FloatingPointMath) – malat Jul 27 '21 at 15:26
  • 1
    Berni Post a sample of the code test harness. The description is insufficient. – chux - Reinstate Monica Jul 27 '21 at 16:45
  • 2
    Benchmarking without optimization is normally pointless, often introducing different bottlenecks from what you're trying to benchmark. However, `-O3` might optimize `pow(x, 2)` into a simple `x*x` because GCC defines `pow` (and probably `powf`) as builtin functions by default. Or might have optimized away your benchmark if you didn't write it carefully. Without details we can't tell you exactly what's going on. – Peter Cordes Jul 27 '21 at 23:02
  • Thanks for your comments. I added a minimal example to show what I did. – Berni Jul 28 '21 at 08:17
  • 1
    Your inputs are all constants that will completely defeat your benchmark at `-O3`, and your repeat loop is only 1000 iterations so it's questionable even at `-O0`. Any results are probably noise or warm-up effects. ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)). Also, your test harness doesn't print the elapsed time. – Peter Cordes Jul 28 '21 at 08:17
  • Also, the format for printing a double is `%f` - float is implicitly promoted to double for variadic functions like printf. Most implementations do support `%lf` as a synonym that also takes a double, which is why your code works. – Peter Cordes Jul 28 '21 at 08:23
  • Your repeat loop also doesn't do anything to force re-computing of the result. You might use `-fno-builtin-pow` / `-fno-builtin-powf` to force GCC not to inline when optimization is enabled, which of course would make it slower with a simple constant arg like 2.0. Unless that's what you *want* to test. – Peter Cordes Jul 28 '21 at 08:25

1 Answers1

1

Just a shot in the dark:

If the hardware implements only double, then float will be slower if conversion to/from the native double format isn't free as part of float-load and float-store instructions.

For reference:

malat
  • 12,152
  • 13
  • 89
  • 158
  • 2
    BeagleBone Black is ARM Cortex-A8 (https://beagleboard.org/black) with NEON for scalar/SIMD FP, so it definitely supports single-precision. – Peter Cordes Jul 27 '21 at 23:04