2

I haven't been using OpenMP for a long time and I have trouble optimizing this code:

#define SIZE 100000000

typedef struct {
  float a,b,c,d,e,f,g,h;
} s_t;
  
void compute(s_t *a) {
  int i;
  for (i=4; i<SIZE; i++) {
    a[i].a=(a[i-1].b * 0.42 + a[i-3].d * .32);
    a[i].b = a[i].c * 3.56 - a[i-3].f;
    a[i].c = a[i].d + a[i].g*a[i-3].d;
    a[i].d = .3334/sqrt(a[i].f*a[i].f + a[i].c*a[i].c);
    if (a[i].f + a[i].a>1e-3) 
      a[i].f = (a[i].f - a[i].a)*(a[i].f + a[i].a);

  }
}

int main() {
  int i;
  s_t *a;
  a=(s_t *)malloc(sizeof(s_t)*SIZE);
  /* Initialization */
  for (i=0; i<SIZE; i++) 
    a[i].a=a[i].b=a[i].c=a[i].d=a[i].e=a[i].f=a[i].g=a[i].h=1./(i+1);
  /* Computation */
  for(i=0;i<100;i++) {
    compute(a);
    fprintf(stderr,".");
  }
  fprintf(stderr,"%f ",a[10].a);
  free(a);
  return 0;
}

I would like to use a "#pragma omp parallel for" on the loop in the compute function but there are several dependencies.

I tried with the depend clause but I think that having a[i] depends of a[i-1] and a[i-3] will just make the code sequential. I don't really know how to handle this problem with OpenMP. Could you give me some ideas and or guidance on how to do it ?

I added the main so that you guys can see how the compute function is called. If you have any other ideas on how to optimize the code with OpenMP or any other method please let me know.

  • 2
    How big is `SIZE`? – Laci Jan 11 '23 at 14:30
  • 1
    Really dependent on the use case whether this would be worth it, but you could have `a[i-3].d, f` and `a[i-1].b` as parts of the struct. For this single loop, it probably isn't worth the effort. – Jason Jan 11 '23 at 14:32
  • 1
    @Laci SIZE is 100000000. It is just a simple code with a huge amount of computation designed to be optimized with several methods. – Unemployed Goose Jan 11 '23 at 14:33
  • 3
    This is a recurrence and so the code as such is not simply parallelizable. You could try recursive doubling. Maybe if you take a step back and describe what you're actually trying to do? There may be a completely different way to phrasing this. – Victor Eijkhout Jan 11 '23 at 14:40
  • 4
    You may wish to use `sqrtf` and `float` constants (e.g. `0.42f`). – Laci Jan 11 '23 at 14:40
  • 2
    Note that 1/sqrt(x) can be computed significantly faster with a lower precision. That being said, the long chain of 100_000_000 operations will certainly result in a pretty big numerical error. Since this code is *inherently sequential*, you need to focus on the critical path so to make it faster. More specifically, you certainly need to reduce the latency of the instructions on the critical path. – Jérôme Richard Jan 11 '23 at 14:58
  • 2
    Another observation. Your code looks like a recurrence, but it is not if you look at the separate components. For instance, the first line of the body computes the `a[i].a` component from previous `i` values, but that `.a` component is not used elsewhere in the loop, so you could make a separate completely parallel loop computing just the `a[i].a` values. (There is the matter of that `if` statement. I think that can be moved to a separate loop too.) But you need to work this out carefully. It's not simple. – Victor Eijkhout Jan 11 '23 at 14:59
  • @VictorEijkhout If one follows the dependencies, all components still have recurrences onto the same component. These recurrences happen at a distance of 3 or 4, so there might be some parallelism, but it certainly doesn't scale. – paleonix Jan 11 '23 at 17:43
  • 1
    Using `sqrtf`, `float` constants and `-ffast-math` flag makes the code run about 5 times faster. Check it on [godbolt](https://godbolt.org/z/8qeEqEovY) – Laci Jan 11 '23 at 17:43
  • 1
    @Laci Note that GodBolt is not designed for benchmarks. In fact, timings are very affected by other people running compilation on the same machine. With your link, I got 1.07 s the fist run, then 0.48 (by just changing a space), then 1.11 and then 0.51. A >x2 time slower execution from the same code is pretty huge to do precise benchmarks ;) . QuickBench is better for that (though not perfect). – Jérôme Richard Jan 11 '23 at 18:41

1 Answers1

3

The optimization of this code turns out to be very difficult because of the dependencies. There is no really efficient way to parallelize this code using multiple threads on modern mainstream processors. That being said, some operations are independent and are the same on multiple data. Thus, we can benefit from instruction-level parallelism and SIMD instructions. Indeed, a modern processor core can execute multiples instructions simultaneously during a given cycle and this instruction can operate on several values at the same time.

Additionally, we can also benefit from the fast reciprocal square root provided by mainstream x86-64 processors. In fact, this is what makes the -ffast-math flag faster as well as using FMA instructions so to reduce the latency of the critical path instructions. That being said, this instruction (as well as using -ffast-math causes the result to be possibly different from one machine to another and also possibly less precise. On top of that, double-precision constants can be replaced by floating-point ones as proposed by @Laci (not to mention the use of sqrtf, though there is no double-precision reciprocal square root instructions on x86-64 processor anyway).


Implementation

To be able to use SIMD instructions, one need to track the dependencies and find multiple same instruction working on different values in the C code. The first step is to unroll the loop 3 times due to the i-3 dependency. Then, the second step consists in gathering the same operations while taking care of dependencies. The results in an ugly big code like this:

void compute_unroll(s_t* a)
{
    int i;

    for (i=4; i<SIZE-2; i+=3)
    {
        float dm3 = a[i-3].d;
        float dm2 = a[i-2].d;
        float dm1 = a[i-1].d;

        float fm3 = a[i-3].f;
        float fm2 = a[i-2].f;
        float fm1 = a[i-1].f;

        float bm1 = a[i-1].b;

        float a0 = a[i].a;
        float b0 = a[i].b;
        float c0 = a[i].c;
        float d0 = a[i].d;
        float e0 = a[i].e;
        float f0 = a[i].f;
        float g0 = a[i].g;

        float a1 = a[i+1].a;
        float b1 = a[i+1].b;
        float c1 = a[i+1].c;
        float d1 = a[i+1].d;
        float e1 = a[i+1].e;
        float f1 = a[i+1].f;
        float g1 = a[i+1].g;

        float a2 = a[i+2].a;
        float b2 = a[i+2].b;
        float c2 = a[i+2].c;
        float d2 = a[i+2].d;
        float e2 = a[i+2].e;
        float f2 = a[i+2].f;
        float g2 = a[i+2].g;

        b0 = c0 * 3.56f - fm3;
        b1 = c1 * 3.56f - fm2;
        b2 = c2 * 3.56f - fm1;

        a0 = bm1 * 0.42f + dm3 * 0.32f;
        a1 = b0 * 0.42f + dm2 * 0.32f;
        a2 = b1 * 0.42f + dm1 * 0.32f;

        c0 = d0 + g0*dm3;
        c1 = d1 + g1*dm2;
        c2 = d2 + g2*dm1;

        d0 = 0.3334f / sqrtf(f0*f0 + c0*c0);
        d1 = 0.3334f / sqrtf(f1*f1 + c1*c1);
        d2 = 0.3334f / sqrtf(f2*f2 + c2*c2);

        if (f0 + a0 > 1e-3f) f0 = (f0 - a0) * (f0 + a0);
        if (f1 + a1 > 1e-3f) f1 = (f1 - a1) * (f1 + a1);
        if (f2 + a2 > 1e-3f) f2 = (f2 - a2) * (f2 + a2);

        a[i].a = a0;
        a[i].b = b0;
        a[i].c = c0;
        a[i].d = d0;
        a[i].f = f0;

        a[i+1].a = a1;
        a[i+1].b = b1;
        a[i+1].c = c1;
        a[i+1].d = d1;
        a[i+1].f = f1;

        a[i+2].a = a2;
        a[i+2].b = b2;
        a[i+2].c = c2;
        a[i+2].d = d2;
        a[i+2].f = f2;
    }

    for (; i<SIZE; ++i)
    {
        a[i].a = (a[i-1].b * 0.42f + a[i-3].d * 0.32f);
        a[i].b = a[i].c * 3.56f - a[i-3].f;
        a[i].c = a[i].d + a[i].g*a[i-3].d;
        a[i].d = 0.3334f / sqrtf(a[i].f*a[i].f + a[i].c*a[i].c);

        if (a[i].f + a[i].a > 1e-3f)
            a[i].f = (a[i].f - a[i].a) * (a[i].f + a[i].a);
    }
}

Now, we can see that some parts can easily benefit from using SIMD instructions like this one for example:

d0 = 0.3334f / sqrtf(f0*f0 + c0*c0);
d1 = 0.3334f / sqrtf(f1*f1 + c1*c1);
d2 = 0.3334f / sqrtf(f2*f2 + c2*c2);

Some other parts require data from multiple location in memory. gathering data from non-contiguous location is generally not efficient. The problem is that the input layout is not efficient in the first place. It would be more efficient to store data by storing it like a0 a1 a2 b0 b1 b2 c0 c1 c2 ... rather than a0 b0 c0 d0 e0 f0 g0 a1 b1 c1 .... Indeed, this alternative layout enables the processor to load/store data block by block (SIMD load/store) rather than 1 value at a time (scalar load). That being said, it may not be possible to change the data layout regarding the context in which this code is used. Because of that, the following part of this answer will not consider this optimization.

Now we can vectorize the code using SIMD instructions. There are many ways to do this. There are for example some relatively high-level libraries to do that. OpenMP also help to do that. However, the performance of these solutions tend to be quite disappointing for a pathological unusual case like this one. Consequently, I choose to use low-level x86-64 SSE intrinsics. More specifically, I considered the SSE4.1 instruction set (supported by >99% of PCs) and the FMA instruction set (also widely supported and fmadd/fmsub instructions can easily be replaced by fmul+fadd/fsub instructions anyway if needed).

void compute_sse(s_t* a)
{
    int i = 4;

    if (i<SIZE-2)
    {
        __m128 vdm = _mm_setr_ps(a[i-3].d, a[i-2].d, a[i-1].d, 0.0f);
        __m128 vfm = _mm_setr_ps(a[i-3].f, a[i-2].f, a[i-1].f, 0.0f);
        float bm1 = a[i-1].b;

        for (; i<SIZE-2; i+=3)
        {
            float a0 = a[i].a, a1 = a[i+1].a, a2 = a[i+2].a;
            float b0 = a[i].b, b1 = a[i+1].b, b2 = a[i+2].b;
            float c0 = a[i].c, c1 = a[i+1].c, c2 = a[i+2].c;
            float d0 = a[i].d, d1 = a[i+1].d, d2 = a[i+2].d;
            float e0 = a[i].e, e1 = a[i+1].e, e2 = a[i+2].e;
            float f0 = a[i].f, f1 = a[i+1].f, f2 = a[i+2].f;
            float g0 = a[i].g, g1 = a[i+1].g, g2 = a[i+2].g;

            // vb[j] = vc[j] * 3.56f - vfm[j]
            __m128 vc = _mm_setr_ps(c0, c1, c2, 0.0f);
            __m128 vb = _mm_fmsub_ps(vc, _mm_set1_ps(3.56f), vfm);

            _MM_EXTRACT_FLOAT(b0, vb, 0);
            _MM_EXTRACT_FLOAT(b1, vb, 1);
            _MM_EXTRACT_FLOAT(b2, vb, 2);

            a[i].b = b0;
            a[i+1].b = b1;
            a[i+2].b = b2;

            // va[j] = vb_prev[j] * 0.42f + vdm[j] * 0.32f
            __m128 vb_prev = _mm_setr_ps(bm1, b0, b1, 0.0f);
            __m128 va = _mm_fmadd_ps(vb_prev, _mm_set1_ps(0.42f), _mm_mul_ps(vdm, _mm_set1_ps(0.32f)));

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].a, va, 0);
            _MM_EXTRACT_FLOAT(a[i+1].a, va, 1);
            _MM_EXTRACT_FLOAT(a[i+2].a, va, 2);

            // vc[j] = vg[j] * vdm[j] + vd[j]
            __m128 vd = _mm_setr_ps(d0, d1, d2, 0.0f);
            __m128 vg = _mm_setr_ps(g0, g1, g2, 0.0f);
            vc = _mm_fmadd_ps(vg, vdm, vd);

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].c, vc, 0);
            _MM_EXTRACT_FLOAT(a[i+1].c, vc, 1);
            _MM_EXTRACT_FLOAT(a[i+2].c, vc, 2);

            // d[j] = 0.3334f / sqrtf(vf[j]*vf[j] + vc[j]*vc[j])
            __m128 vf = _mm_setr_ps(f0, f1, f2, 0.0f);
            __m128 vf2 = _mm_mul_ps(vf, vf);
            __m128 vc2 = _mm_mul_ps(vc, vc);
            __m128 vsum = _mm_add_ps(vf2, vc2);
            vd = _mm_mul_ps(_mm_set1_ps(0.3334f), _mm_rsqrt_ps(vsum));

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].d, vd, 0);
            _MM_EXTRACT_FLOAT(a[i+1].d, vd, 1);
            _MM_EXTRACT_FLOAT(a[i+2].d, vd, 2);

            // if(f[j] + a[j] > 1e-3f) f[j] = (f[j] - a[j]) * (f[j] + a[j]);
            __m128 vfpa = _mm_add_ps(vf, va);
            __m128 vcond = _mm_cmpgt_ps(vfpa, _mm_set1_ps(1e-3f));
            __m128 vfma = _mm_sub_ps(vf, va);
            vf = _mm_blendv_ps(vf, _mm_mul_ps(vfma, vfpa), vcond);

            // Store data to memory
            _MM_EXTRACT_FLOAT(a[i].f, vf, 0);
            _MM_EXTRACT_FLOAT(a[i+1].f, vf, 1);
            _MM_EXTRACT_FLOAT(a[i+2].f, vf, 2);

            // Useful for the next iteration not to reload values from memory
            vdm = vd;
            vfm = vf;
            bm1 = b2;
        }
    }

    // Remaining part
    for (; i<SIZE; ++i)
    {
        a[i].a = (a[i-1].b * 0.42f + a[i-3].d * 0.32f);
        a[i].b = a[i].c * 3.56f - a[i-3].f;
        a[i].c = a[i].d + a[i].g*a[i-3].d;
        a[i].d = 0.3334f / sqrtf(a[i].f*a[i].f + a[i].c*a[i].c);

        if (a[i].f + a[i].a > 1e-3f)
            a[i].f = (a[i].f - a[i].a) * (a[i].f + a[i].a);
    }
}

Note that this code try to keep data as much as possible in registers (fast) rather than reloading them from memory (slow). Still, on my machine, this code takes a significant portion of its time reading/writing data from/to memory. This is mainly due to the inefficient memory layout, but also because it is hard for one core to fully saturate the memory.


Results

Here are experimental results on my i5-9600KF processor using GCC 10.3 with the flags -O3 -march=native -ffast-math:

compute (no -ffast-math):  0.444 ms/it
compute:                   0.404 ms/it
compute_laci:              0.318 ms/it
compute_unroll:            0.317 ms/it
compute_sse:               0.254 ms/it
seq optimal lower-bound:   0.190 ms/it  (saturation of the RAM by 1 core)
par optimal lower-bound:   0.170 ms/it  (saturation of the RAM)

compute_sse is mostly memory-bound on my machine as it achieves to reach a good throughput of 23.5 GiB/s, while the maximum is generally 31-32 GiB/s (for read/writes) using 1 core, and never more than 34-36 GiB/s (for read/writes) using multiple cores in practice. A better memory-layout should be enough to get an execution time very close to the optimal execution time using 1 core.

Note that server processors like Intel Xeon tends to clearly not saturate the RAM bandwidth due to the way the architecture is designed (they are meant to run parallel codes which scale). In fact, the practical RAM throughput a core can achieve is generally significantly smaller than the one on a mainstream PC. Thus, this can can be less efficient on such server processor. See this answer for more information.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59