matrix optimization - segmentation fault when using intrinsics and loop unrolling

Question

I'm currently trying to optimize matrix operations with intrinsics and loop unrolling. There was segmentation fault which I couldn't figure out. Here is the code I made change:

const int UNROLL = 4;

void outer_product(matrix *vec1, matrix *vec2, matrix *dst) {
    assert(vec1->dim.cols == 1 && vec2->dim.cols == 1 && vec1->dim.rows == dst->dim.rows && vec2->dim.rows == dst->dim.cols);
    __m256 tmp[4];
    for (int x = 0; x < UNROLL; x++) {
        tmp[x] = _mm256_setzero_ps();
    } 
    for (int i = 0; i < vec1->dim.rows; i+=UNROLL*8) {
        for (int j = 0; j < vec2->dim.rows; j++) {     
            __m256 row2 = _mm256_broadcast_ss(&vec2->data[j][0]);
            for (int x = 0; x<UNROLL; x++) {
                tmp[x] = _mm256_mul_ps(_mm256_load_ps(&vec1->data[i+x*8][0]), row2); 
                _mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
            } 
        }
    }
}

void matrix_multiply(matrix *mat1, matrix *mat2, matrix *dst) {
    assert (mat1->dim.cols == mat2->dim.rows && dst->dim.rows == mat1->dim.rows && dst->dim.cols == mat2->dim.cols);
    for (int i = 0; i < mat1->dim.rows; i+=UNROLL*8) {
        for (int j = 0; j < mat2->dim.cols; j++) {
        __m256 tmp[4];
            for (int x = 0; x < UNROLL; x++) {
                tmp[x] = _mm256_setzero_ps();
            } 
            for (int k = 0; k < mat1->dim.cols; k++) {
                __m256 mat2_s = _mm256_broadcast_ss(&mat2->data[k][j]);
                for (int x = 0; x < UNROLL; x++) {
                    tmp[x] = _mm256_add_ps(tmp[x], _mm256_mul_ps(_mm256_load_ps(&mat1->data[i+x*8][k]), mat2_s));
                }
            }
            for (int x = 0; x < UNROLL; x++) {
                _mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
            }    
        }
    }    
}

edited： Here is the struct of matrix. I didn't modified it.

typedef struct shape {
    int rows;
    int cols;
} shape;

typedef struct matrix {
    shape dim;
    float** data;
} matrix;

edited: I tried gdb to figure out which line caused segmentation fault and it looked like it was _mm256_load_ps(). Am I indexing into the matrix in a wrong way such that it cannot load from the correct address? Or is the problem of aligned memory?

What is the purpose of providing 2-copies of the same `outer_product`? — David C. Rankin, Nov 20 '17 at 22:53
Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a [mcve]. — Stargateur, Nov 20 '17 at 22:57
Is your data 32B-aligned? You're using aligned loads/stores (`_mm256_store_ps`). Use a debugger to see what asm instruction actually faults, and what address it's storing into. — Peter Cordes, Nov 21 '17 at 01:29
Also, your code is very inefficient because your inner loop needs two levels of indirection, and you're not looping over contiguous memory (each iteration of the inner loop touches a 32B chunk of a different row.) If you look at the asm output, gcc is redoing the same 2 pointer loads, then 2 dependent loads with a different offset, for every `vmulps` / `vmovaps`. clang decides that the vector stores can't alias the pointer data, I guess, but it still has to load two pointers for every `vmulps` / `vmovaps`, which is pretty bad. https://godbolt.org/g/Am8PyQ. — Peter Cordes, Nov 21 '17 at 01:35
Anyway, use 2D array indexing like `foo[row * col_dim + col]` to emulate a 2D array. Also, prefer using unsigned dimensions; that will usually compile to better asm. And loop over your arrays in contiguous order. For matmul, that means transposing one ahead of time. See also [how to optimize matrix multiplication (matmul) code to run fast on a single processor core](https://stackoverflow.com/questions/35401524/how-to-optimize-matrix-multiplication-matmul-code-to-run-fast-on-a-single-proc) for an example of 2D indexing (but the code in that question is slow). — Peter Cordes, Nov 21 '17 at 01:39
But more importantly, **see [How does BLAS get such extreme performance?](https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance)**. Related, see my comments on [Poor maths performance in C vs Python/numpy](https://stackoverflow.com/questions/46552143/poor-maths-performance-in-c-vs-python-numpy) for more about bad memory access patterns. — Peter Cordes, Nov 21 '17 at 01:43
@PeterCordes Thank you so much for your help! About the matrix indexing, I use mat->data[i][j] because in the naive version, which is the starter code, it is indexing the matrix in that way and it works fine. I thought that looks more clear so I didn't change it. Does it make any difference when applied to _mm256_load? — Aria Lin, Nov 21 '17 at 02:43
No, it doesn't directly affect `_mm256_load`, as long as your row pointers point to 32-byte aligned memory. Using `float *data` pointing to one large buffer and doing 2D indexing will still remove that level of indirection, though. — Peter Cordes, Nov 21 '17 at 02:46
@PeterCordes And I'm trying to optimize the code step by step. First, I just want to add SIMD and loop unrolling to make some speedup, but I was stuck in the beginning. After that works, I will do further optimization like parallel threads and cache blocking. — Aria Lin, Nov 21 '17 at 02:56
I'd recommend you get your data layout correct *first*, before you spend any time on manual vectorization. e.g. transpose one input first so you're looping over contiguous memory for both the rows of one matrix and the columns of the other. Then the compiler should be able to auto-vectorize to some degree. Then worry about cache-blocking for an optimal access pattern. *Then* manually vectorize if the compiler's auto-vectorized asm output isn't optimal. (Or better, call an optimized BLAS library function that already has all these optimizations.) — Peter Cordes, Nov 21 '17 at 03:02
Any time you spend tuning your SIMD code for optimal performance (like an unroll factor) with a terrible access pattern is mostly wasted, because the tuning parameters will be different when you change the data layout. Also, changing the data layout means you'll also need to rewrite two functions, instead of just the scalar reference implementation. — Peter Cordes, Nov 21 '17 at 03:05

Peter Cordes · Answer 1 · 2017-11-21T03:41:58.003

In at least one place, you're doing 32-byte alignment-required loads with a stride of only 4 bytes. I think that's not what you actually meant to do, though:

for (int k = 0; k < mat1->dim.cols; k++) {
    for (int x = 0; x < UNROLL; x++) {
        ...
        _mm256_load_ps(&mat1->data[i+x*8][k])
     }
 }

_mm256_load_ps loads 8 contiguous floats, i.e. it loads data[i+x*8][k] to data[i+x*8][k+7]. I think you want data[i+x][k*8], and loop over k in the inner-most loop.

If you need unaligned loads / stores, use _mm256_loadu_ps / _mm256_storeu_ps. But prefer aligning your data to 32B, and pad the storage layout of your matrix so the row stride is a multiple of 32 bytes. (The actual logical dimensions of the array don't have to match the stride; it's fine to leave padding at the end of each row out to a multiple of 16 or 32 bytes. This makes loops much easier to write.)

You're not even using a 2D array (you're using an array of pointers to arrays of float), but the syntax looks the same as for float A[100][100], even though the meaning in asm is very different. Anyway, in Fortran 2D arrays the indexing goes the other way, where incrementing the left-most index takes you to the next position in memory. But in C, varying the left index by one takes you to a whole new row. (Pointed to by a different element of float **data, or in a proper 2D array, one row stride away.) Of course you're striding by 8 rows because of this mixup combined with using x*8.

Speaking of the asm, you get really bad results for this code especially with gcc, where it reloads 4 things for every vector, I think because it's not sure the vector stores don't alias the pointer data. Assign things to local variables to make sure the compiler can hoist them out of loops. (e.g. const float *mat1dat = mat1->data;.) Clang does slightly better, but the access pattern in the source is inherently bad and requires pointer-chasing for each inner-loop iteration to get to a new row, because you loop over x instead of k. I put it up on the Godbolt compiler explorer.

But really you should optimize the memory layout first, before trying to manually vectorize it. It might be worth transposing one of the arrays, so you can loop over contiguous memory for rows of one matrix and columns of the other while doing the dot product of a row and column to calculate one element of the result. Or it could be worth doing c[Arow,Bcol] += a_value_from_A * b[Arow,Bcol] inside an inner loop instead of transposing up front (but that's a lot of memory traffic). But whatever you do, make sure you're not striding through non-contiguous accesses to one of your matrices in the inner loop.

You'll also want to ditch the array-of-pointers thing and do manual 2D indexing (data[row * row_stride + col] so your data is all in one contiguous block instead of having each row allocated separately. Making this change first, before you spend any time manually-vectorizing, seems to make the most sense.

gcc or clang with -O3 should do a not-terrible job of auto-vectorizing scalar C, especially if you compile with -ffast-math. (You might remove -ffast-math after you're done manually vectorizing, but use it while tuning with auto-vectorization).

How does BLAS get such extreme performance?
Also see my comments on Poor maths performance in C vs Python/numpy for another bad-memory-layout problem.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core

You might manually vectorize before or after looking at cache-blocking, but when you do, see Matrix Multiplication with blocks.

matrix optimization - segmentation fault when using intrinsics and loop unrolling

1 Answers1