0

I want to use SSE intrinsics that will replace this code:

for (i=1;i<=imax;i++) {
        for (j=1;j<=jmax;j++) {
            if (flag[i][j] & C_F) {
                float fcalc = f[i][j]-f[i-1][j];
                float a = fcalc/delx;
                float gcalc = g[i][j]-g[i][j-1];
                float b = gcalc/dely;
                float add = a+b;
                rhs[i][j] = add/del_t;
        }
}

flag is a 2D array of characters and C_F is a constant

#define C_F      0x0010

I am new to SSE and I know how basic branching works with floats but I have no idea how it works with constants and characters. Any help will be greatly appreciated

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
  • 2
    You can't easily do a per-element conditional store, but what you normally do instead is blend with the old value to store back what was originally there for elements where the condition is false. Like `rhs = condition? updated : rhs;`. You're definitely going to want to multiply by a loop-invariant `1.0f/delx` instead of actually dividing. – Peter Cordes Mar 19 '21 at 23:49
  • 2
    To actually use `flag[i][j + 0..3]` as a condition, you'd want to left-shift so that selected bit is at the top of each 32-bit element, then use SSE4.1 `_mm_blendv_ps` (or emulate that with AND/ANDN/OR after broadcasting the bit with an arithmetic right shift by 31) – Peter Cordes Mar 19 '21 at 23:51
  • 2
    delx*del_t and dely *del_t are invariants and should be strength reduced. Take reciprocal of those and multiple with a and b - keeping in mind that this will slightly change your floating point answers as fp math is not 100% communitive nor associative. This turns loop into 2 subtracts, 2 muls, and a single add. The if part Peter has explained well :) – Michael Dorgan Mar 20 '21 at 00:38
  • 1
    Oh, and if `flag[][]` is a `char[][]` array, you'll want to load it with SSE4.1 `_mm_cvtepu8_epi32`, like in [Loading 8 chars from memory into an \_\_m256 variable as packed single precision floats](https://stackoverflow.com/q/34279513) but without the final convert. Also related: [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) but you have bytes, not packed bits so it's much easier, just `pmovzxbd`. – Peter Cordes Mar 20 '21 at 01:10

0 Answers0