Why does this functor perform better than plain code when optimized?

Question

Continuing my quest on optimizing finite-difference code… I've managed to make a generalized algorithm for summing adjacent cell differences, by using using multiline macros. Previously I used functors, but the performance was dismal.

// T0, T: double[T_size_x*T_size_y*T_size_z]
template <bool periodic_z, bool periodic_y, bool periodic_x>
void process_T(double *T0, double *T, const int &T_size_x, const int &T_size_y, const int &T_size_z) {
  double sum, base;
  const int dy = (T_size_y-1)*T_size_x;
  const int dz = (T_size_z-1)*T_size_x*T_size_y;
  int pos = 0; //T_size_x * j;
  struct _Diff {
    inline void operator() (const int &pos) { sum += T0[pos] - base; }
    inline void edge(const int &pos) { sum += 0.5*(T0[pos] -base); }
    _Diff (double &s, double &b, double *t0) : sum(s), base(b), T0(t0) {}
  private:
    double &sum;
    double &base;
    double *T0;
  } Diff (sum, base, T0);
  #define BigOne(i, periodic, T_size, left, right, start, end) \
  if (i > 0) Diff(left); \
  if (i < T_size-1) Diff(right); \
  if (!periodic) { \
    if (i == T_size-1 || i == 1) Diff.edge(left); \
    if (i == T_size-2 || i == 0) Diff.edge(right); \
  } else { \
    if (i == 0) Diff(end); \
    if (i == T_size-1) Diff(start); \
  }   
  for (int k = 0; k < T_size_z; ++k) {
    for (int j = 0; j < T_size_y; ++j) {
      for (int i = 0; i < T_size_x; ++i, ++pos) {
    sum = 0;
    base = T0[pos];
    // process x direction
    BigOne(i, periodic_x, T_size_x, pos-1, pos+1, pos - i, pos + T_size_x-1)
    // process y direction
    BigOne(j, periodic_y, T_size_y, pos-T_size_x, pos+T_size_x, pos - dy, pos + dy)
    // process z direction
    BigOne(k, periodic_z, T_size_z, pos-T_size_x*T_size_y, pos+T_size_x*T_size_y, pos - dz, pos + dz)
    T[pos] = T0[pos] + sum * 0.08; // where 0.08 is some magic number
      }
    }
  }
}

To make the code more performant, I considered converting the functor into a macro as well.

#define Diff(pos) sum += T0[pos] - base
#define Diff_edge(pos) sum += 0.5*(T0[pos]-base)
// Note: change Diff.edge in BigOne to Diff_edge as well

When compiled with g++ (4.8.2) without optimizations, as expected, macro'd code runs faster by a wide margin. But when I compile it with -O2 or -O3, suddenly the first example above yields better result (in 15000 iterations, functor finishes in 22.7s, macro in 23.7s).

Why is this happening? Does the functor somehow serve as a hint to the compiler to cache instructions?

Try looking at the generated assembly, and/or run both versions in Cachegrind (http://valgrind.org/info/tools.html) if you suspect instruction cache effects — Peter, Oct 29 '13 at 17:02
Difference seems to be due to branching misses. I haven't understood how the functor code reduces that. — syockit, Oct 29 '13 at 18:23

Why does this functor perform better than plain code when optimized?

0 Answers0