7

I've noticed that math functions (like ceil, round, ...) take more CPU cycles after running any intel AVX function.

See following example:

#include <stdio.h>
#include <math.h>
#include <immintrin.h>


static unsigned long int get_rdtsc(void)
{
        unsigned int a, d;
        asm volatile("rdtsc" : "=a" (a), "=d" (d));
        return (((unsigned long int)a) | (((unsigned long int)d) << 32));
}

#define NUM_ITERATIONS 10000000

void run_round()
{
    unsigned long int t1, t2, res, i;
    double d = 3.2;

    t1 = get_rdtsc();
    for (i = 0 ; i < NUM_ITERATIONS ; ++i) {
        res = round(d*i);
    }
    t2 = get_rdtsc();

    printf("round res %lu total cycles %lu CPI %lu\n", res, t2 - t1, (t2 - t1) / NUM_ITERATIONS);
 }

int main ()
{
    __m256d a;

    run_round();

    a = _mm256_set1_pd(1);

    run_round();

    return 0;
}

compile with: gcc -Wall -lm -mavx foo.c

The output is:

round res 31999997 total cycles 224725952 CPI 22

round res 31999997 total cycles 1900864520 CPI 190

Please advise.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
kayan4096
  • 71
  • 1
  • What is the target platform? Linux, OS X, something else? – Eric Postpischil Dec 12 '13 at 14:36
  • 1
    Thanks @StephenCanon adding __asm("VZEROUPPER") after the avx call did the trick. However, I wonder, isn't this a bug in gcc - one call to an avx intrinsic adds a performance hit to all legacy library calls unless the YMM register is cleaned? – kayan4096 Dec 12 '13 at 15:27
  • 3
    @kayan4096: I’m surprised that GCC doesn’t insert the `vzeroupper` for you; clang does that. – Stephen Canon Dec 12 '13 at 15:34
  • 1
    Probably worth filing a bug report with the GCC devs to request they do this as well. – Stephen Canon Dec 12 '13 at 15:48
  • @StephenCanon, your comments here helped me me solve the following question http://stackoverflow.com/questions/21960229/unexpectedly-good-performance-with-openmp-parallel-for-loop. – Z boson Feb 23 '14 at 22:02
  • @StephenCanon, GCC does not do this by default like Clang but it does has an option `-mvzeroupper` which will do this. – Z boson Jun 30 '14 at 12:27

1 Answers1

0

Disassemble the generated code.

My guess would be that there is additional register saving/restoring going on, or something like that.

unwind
  • 391,730
  • 64
  • 469
  • 606
  • There is no difference in the Apple GCC version; `run_round` is called in two places from `main`, so, within the routine, it executes identical instructions. I would expect the same in other GCC versions. – Eric Postpischil Dec 12 '13 at 14:56