SSE for 2D arrays

Question

I want to change the following code using SSE3 instructions:

 for (i=0; i<=imax+1; i++) {
        /* The vertical velocity approaches 0 at the north and south
         * boundaries, but fluid flows freely in the horizontal direction */
        v[i][jmax] = 0.0;
        u[i][jmax+1] = u[i][jmax];

        v[i][0] = 0.0;
        u[i][0] = u[i][1];
    }

u and v are 2D arrays of type float. What I have so far is this but the program does not run correctly.

    int loop2 = ((imax+1) / loopFactor) * loopFactor;
    for(i=0; i<loop2; i+=loopFactor) {
        
        __m128 zeroVec = _mm_set1_ps(0.0f);
        _mm_storeu_ps(&v[i][jmax], zeroVec);
        __m128 umaxVec = _mm_loadu_ps(&u[i][jmax]);
        _mm_storeu_ps(&u[i][jmax+1], umaxVec);

        __m128 zVec = _mm_set1_ps(0.0f);
        _mm_storeu_ps(&v[i][0], zVec);
        __m128 uVec = _mm_loadu_ps(&u[i][1]);
        _mm_storeu_ps(&u[i][0], uVec);
    }
    for (; i<=imax+1; i++){
        v[i][jmax] = 0.0;
        u[i][jmax+1] = u[i][jmax];

        v[i][0] = 0.0;
        u[i][0] = u[i][1];
    }

I suspect that this is because _mm_loadu_ps stores values for u[i][1], u[i][2], u[i][3] and u[i][4] but I want to store the values u[i][1], u[i+1][1], u[i+2][1], u[i+3][1] and u[i+4][1]. How can I do that? Loopfactor has a value of 4. Any help is really appreciated.

Sse is quite inefficient when operating on neighbouring values. Is it possible to operate on transposed matrix? — tstanisl, Mar 08 '21 at 20:20
Your problem is a very bad candidate for sse optimizations. It is not cache efficient and it will run slowly..period. playing with sse will not help. If you were able to modify to the problem to make i iterate over the inner dimension (use 'u[x][i]')then the compiler would likely use sse automaticaly. — tstanisl, Mar 08 '21 at 21:48
Last time you asked, you has stuff like `float **matrix;` not a `float (*matrix)[n]` pointer to a 2D array. ([How to cast simple pointer to a multidimensional-array of fixed size?](https://stackoverflow.com/q/11869056)). Not that it makes a big difference for how you'd vectorize, since either way SIMD only lets you operate on contiguous data. It just makes it more efficient to work with multiple rows at once if the offset between them is fixed. — Peter Cordes, Mar 09 '21 at 01:45
Looping over the first dimension is terrible for C arrays (which are row major), although if you work on a whole cache line of data in each row of the column you're looping down, it's slightly less bad. — Peter Cordes, Mar 09 '21 at 01:48
AVX512 introduces [scatter instructions](https://www.felixcloutier.com/x86/vpscatterdd:vpscatterdq:vpscatterqd:vpscatterqq) which could do that. But as others said, it would be way more efficient if you could change your data layout to allow contiguous stores. — chtz, Mar 09 '21 at 16:26

SSE for 2D arrays

0 Answers0