How to convert this C code into Assembly using intrinsics?

Question

this is my first time using intrisics and I have to convert the C code below into C code that uses intrisics for assembly. I don't know where to start.

void slow_routine(float alpha, float beta){

unsigned int i,j;

  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      A[i][j] = A[i][j] + u1[i] * v1[j] + u2[i] * v2[j];


  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      x[i] = x[i] + beta * A[j][i] * y[j];

  for (i = 0; i < N; i++)
    x[i] = x[i] + z[i];


  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      w[i] = w[i] +  alpha * A[i][j] * x[j];

}

Any of the following can be used: x86-64 SSE/SSE2/SSE4/AVX/AVX2

Please help I have been trying for hours and I'm very stuck.

Are you talking about SIMD intrinsics? - have you looked at which ones exist? — 500 - Internal Server Error, Jan 24 '23 at 02:26
Yes I am referring to that but I have fund it massively confusing and I need help getting started — jebateapie123, Jan 24 '23 at 03:35
The first looks very easy for a compiler to auto-vectorize; with the right compile options (like `-O3 -march=x86-64-v3`) there's probably nothing to gain from intrinsics. The 2nd is trickier, you'd want to vectorize over the outer loop as well as inner since the inner loop is striding down columns of `A[][]`. The 4th has a reduction (specifically a dot product) in the inner loop. The 3rd loop can get folded into the outer loop of the 2nd, unless it ends up being better to aggressively transform the loop, like loop interchange (https://en.wikipedia.org/wiki/Loop_interchange) — Peter Cordes, Jan 24 '23 at 05:06
For the dot product, unrolling with multiple accumulators is important, as well as using SIMD vectors: [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) — Peter Cordes, Jan 24 '23 at 05:07
How big are the arrays? If they don't fit in L2 cache or preferably L1d, cache-blocking is also very important. — Peter Cordes, Jan 24 '23 at 05:09
This is how the arrays are defined: __declspec(align(64)) float A[N][N], u1[N], u2[N], v1[N], v2[N], x[N], y[N], w[N], z[N], test[N]; — jebateapie123, Jan 24 '23 at 05:15
Is N a constant at compile time? Can it be a template argument? — Something Something, Jan 24 '23 at 05:18
Note that this question basically asks us to implement significant parts of level 1 and level 2 BLAS. Doing this well is a bit involved, though a simple implementation is likely easy to come up with. — fuz, Jan 24 '23 at 10:00
@Nole Ksum N is constant it is defined at the top and its value is 8192 — jebateapie123, Jan 24 '23 at 11:08
It's 30 years since I did that in physics at Uni but this code looks to me like quantum mechanics; in Dirac notation, `A = A + + `, `|x> = |x> + b * A|y>`, ... or something close to that. It's usually possible to use platform-optimized linear algebra ops for this (intel mkl & co), but the translation to those is easier to write (for me) if I see the original equations... — FrankH., Mar 16 '23 at 08:01

How to convert this C code into Assembly using intrinsics?

0 Answers0