I keep getting segment fault when using SIMD instructions to optimize matrix multiplication.
Here is the core computing part.
The matrices are stored like this:
a large vector<double> buf
of size (3 * 1025 * 1025) is allocated. matrix A starts from buf[0]
, matrix B starts from buf[1025]
and C starts from buf[1025*2]
. I performed various matrix multiplication with size of 4 to 1024. So they could all fit in this vector.
#include <immintrin.h>
#define BLOCK_SIZE 4
/*
* performs 4 * 4 matrix multiplication C=A*B
* C is 4-by-4, A is 4-by-4, and B is 4-by-4, column major matrices
* lda is the size of the large matrix.
*/
static void do_block(int lda4, double* A, double* B, double* C) {
int n=4;
for(int i=0; i<n; i++){ // process i th column
for(int j=0; j<n; j++){
__m256d c = _mm256_load_pd(C+j*lda);
c = _mm256_fmadd_pd(_mm256_load_pd(A+i*lda), _mm256_broadcast_sd(B+i+j*lda), c);
_mm256_store_pd(C+j*lda, c);
}
}
}
/* This routine performs a dgemm operation
* C := C + A * B
* where A, B, and C are lda-by-lda matrices stored in column-major format.
* On exit, A and B maintain their input values. */
void square_dgemm(int lda, double* A, double* B, double* C) {
for (int j = 0; j < lda; j += BLOCK_SIZE) {
// Accumulate block dgemms into block of C
for (int k = 0; k < lda; k += BLOCK_SIZE) {
// For each block-row of A
for (int i = 0; i < lda; i += BLOCK_SIZE) {
do_block(lda, A + i + k * lda, B + k + j * lda, C + i + j * lda);
}
}
}
}
The weird thing is: When I change the vector from size of (3 * 1025 * 1025) to (3 * 1024 * 1024), it gives me segment fault.
My questions are:
- I have learnt that these instructions require aligned data. Indeed replacing by unaligned versions like
_mm256_loadu_pd
eliminates this error. However, since size of(3 * 1024 * 1024 * sizeof(double)) % 32 bytes == 0
, isn't it 32 bytes-aligned or I misunderstood the concept? - I have allocated very large contiguous space, and why it crashes from the beginning, when performing small mat mul (4*4)? I thought that as long as I calling
_mm256_load_pd(addr)
with at least 32 bytes allocated starting fromaddr
, it won't crash, am I wrong? - Why it doesn't crash on
buf
of (3 * 1025 * 1025), but crashes on (3 * 1024 * 1024) ? It seems doesn't crashes when the size is an odd number, like 1025, 1027, 1029 and always crash when the number is even, like 1024, 1026.
The code was compiled using GCC, with -march=native and -O3. The CPU supports FMA, AVX and AVX2. The machine is Google Cloud VM, the CPU is Intel Xeon which I cannot get the exact model. Thanks for your advice!