Why does this code beat rint() and how to I protect it from -ffast-math and friends?

Question

I am looking to find a way to protect some code work from -ffast-math (or msvc/icc equivalents, etc) that works across C compilers.

My inner loop is searching data for numbers that are close to integer values (e.g within ~0.1). Data values are signed, typically less than a few thousand with no inf/nan. The fastest version I found uses a trick with a large magic number:

 remainder = h - ((h+MAGIC)-MAGIC) ;

Does someone have ideas for a way to keep the priority order of the brackets for the key line above? This seems to beat rint(x) by a factor of 3, so I am kind of curious as to why it is working anyway. Could it it be something to do with vectorisation?

Most compilers "simplify" the expression when using -ffast-math or equivalent and it stops working. I want to keep the performance (3X is quite a lot) but also keep it vaguely portable (given MAGIC depends on having the right ieee). If I add a volatile then it slows down but seems to give right answers with fast-math but is then slower than rint:

 volatile t = h+MAGIC; t-=MAGIC;
 remainder = h - t;

A complete example is below. I tried some gcc things like __attribute__((optimize("-fno-associative-math"))) but it doesn't seem like the right approach to eventually work for icc/gcc/msvc/clang etc. The related C99 standard pragma's don't seem to be widely available either.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>

/* https://stackoverflow.com/questions/17035464/a-fast-method-to-round-a-double-to-a-32-bit-int-explained */
union i_cast {double d; int i[2];};
#define MAGIC 6755399441055744.0
/* x86_64 for me today */
#define ENDIANLOC 0
#define REMAINDER_LUA  \
    {volatile union i_cast u; u.d = (h) + MAGIC; r = h - u.i[ENDIANLOC]; }

#define REMAINDER_MAGIC  r=(h - ((h+MAGIC)-MAGIC));
#define REMAINDER_RINT   r=(h - rint(h));
#define REMAINDER_TRUNC  r=(h - ( (h>0) ? ((int)(h+0.5)) : ((int)(h-0.5))) );
#define REMAINDER_FLOOR  r=(h - floor(h+0.5));
#define REMAINDER_REMAIN r=(remainder(h, 1.0));
#define REMAINDER_ROUND  r=(h - round(h));
#define REMAINDER_NEARBY r=(h - nearbyint(h));

#define block(MACRO) {                                                  \
    for(i=0 ; i<3 ; i++){                                               \
      gettimeofday(&start, NULL);                                       \
      n = 0;                                                            \
      for (k = 0; k < ng; k++) {                                        \
        h = mul * gv[k];                                                \
        MACRO                                                           \
          if ( (r*r) < tol ) n++;                                       \
      }                                                                 \
      gettimeofday(&end, NULL);                                         \
      double dt = (double)(end.tv_sec - start.tv_sec);                  \
      dt += (1e-6)*(double)(end.tv_usec - start.tv_usec);               \
      if(i==2)                                                          \
      printf("%20s %d indexed in %lf s %f ns/value\n",#MACRO,           \
             n,dt,1e9*dt/ng);                                           \
    }                                                                   \
  } 

int main(){
  struct timeval start, end;
  // Make some test data
  double h, r, tol = 0.02, mul = 123.4;
  int i, n, k, ng = 1024*1024*32;
  srand(42);
  double *gv = (double *) malloc(ng*sizeof(double));
  for(int i=0;i<ng;i++) { gv[i] = ((double)rand())/RAND_MAX*2.-1.; }
  // Measure some timing
  block(REMAINDER_MAGIC);
  block(REMAINDER_LUA);
  block(REMAINDER_RINT);
  block(REMAINDER_FLOOR);
  block(REMAINDER_TRUNC);
  block(REMAINDER_ROUND);
  block(REMAINDER_REMAIN);
  block(REMAINDER_NEARBY);
  free( gv );
  return 0;
}

For me, today, the output was like this with gcc -O3:

     REMAINDER_MAGIC 9489537 indexed in 0.017953 s 0.535041 ns/value
       REMAINDER_LUA 9489537 indexed in 0.048870 s 1.456439 ns/value
      REMAINDER_RINT 9489537 indexed in 0.050894 s 1.516759 ns/value
     REMAINDER_FLOOR 9489537 indexed in 0.086768 s 2.585888 ns/value
     REMAINDER_TRUNC 9489537 indexed in 0.162564 s 4.844785 ns/value
     REMAINDER_ROUND 9489537 indexed in 0.417856 s 12.453079 ns/value
    REMAINDER_REMAIN 9489537 indexed in 0.517612 s 15.426040 ns/value
    REMAINDER_NEARBY 9489537 indexed in 0.786896 s 23.451328 ns/value

Perhaps some other language (rust/go/opencl/whatever) would do better than C here? Or it is just better to control the compiler flags and add a runtime test in the code for correctness?

https://www.thegeekstuff.com/2010/03/debug-c-program-using-gdb/ — 0___________, Jan 07 '20 at 16:13
@P__J__ What does this have to do with using a debugger? The OP is concerned with code generation for FP math, not with something behaving unexpectedly. — Sneftel, Jan 07 '20 at 16:14
Just FYI, your `REMAINDER_LUA` approach is not portable, it violates the strict aliasing rule and thus produces undefined behavior. — Patrick Roberts, Jan 07 '20 at 16:15
Probably worth noting, `rint` *can* be fast, if compiled with SSE4.1 or later (eg AVX) support (so that `vroundsd` is available). — harold, Jan 07 '20 at 16:22
@harold Thanks! rint does get the win for intel's icc with axv, but gcc does not seem to find this. — Jon, Jan 07 '20 at 17:15
on gcc I could eventually match most of the timings using the following flags. The trapping-math option is turning on/off vectorisation: -O1 -march=native -ftree-vectorize -fopt-info-vec -fno-trapping-math — Jon, Jan 07 '20 at 21:00

score 2 · Accepted Answer · answered Jan 07 '20 at 16:16

There's no standard way to control this nonstandard behavior. Every compiler with a -ffast-math-style option has attributes and pragmas to control that, but those differ, as do the precise effects of the option. For that matter, different versions of some of these compilers have remarkably different fast-math behavior, so it's not merely a matter of a collection of appropriate pragmas. The standard way to get standard behavior is to make the compiler follow the language standard.

-ffast-math and its ilk are intended primarily for programmers who don't care about the details of floating point math, and just want their programs (which make limited and conservative use of FP) to run faster. Most of the useful effects of -ffast-math can be duplicated with carefully written code in any case.

Why does this code beat rint() and how to I protect it from -ffast-math and friends?

1 Answers1