2

I keep getting segment fault when using SIMD instructions to optimize matrix multiplication.

Here is the core computing part. The matrices are stored like this: a large vector<double> buf of size (3 * 1025 * 1025) is allocated. matrix A starts from buf[0], matrix B starts from buf[1025] and C starts from buf[1025*2]. I performed various matrix multiplication with size of 4 to 1024. So they could all fit in this vector.

#include <immintrin.h>
#define BLOCK_SIZE 4
/*
 * performs 4 * 4 matrix multiplication C=A*B
 * C is 4-by-4, A is 4-by-4, and B is 4-by-4, column major matrices
 * lda is the size of the large matrix.
 */
static void do_block(int lda4, double* A, double* B, double* C) {
    int n=4;
    for(int i=0; i<n; i++){ // process i th column
      for(int j=0; j<n; j++){
        __m256d c = _mm256_load_pd(C+j*lda);
        c = _mm256_fmadd_pd(_mm256_load_pd(A+i*lda), _mm256_broadcast_sd(B+i+j*lda), c);
        _mm256_store_pd(C+j*lda, c);
      }
    }
}

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. */
void square_dgemm(int lda, double* A, double* B, double* C) {
    for (int j = 0; j < lda; j += BLOCK_SIZE) {
        // Accumulate block dgemms into block of C
        for (int k = 0; k < lda; k += BLOCK_SIZE) {
            // For each block-row of A
            for (int i = 0; i < lda; i += BLOCK_SIZE) {
                do_block(lda, A + i + k * lda, B + k + j * lda, C + i + j * lda);
            }
        }
    }
}

The weird thing is: When I change the vector from size of (3 * 1025 * 1025) to (3 * 1024 * 1024), it gives me segment fault.

My questions are:

  1. I have learnt that these instructions require aligned data. Indeed replacing by unaligned versions like _mm256_loadu_pd eliminates this error. However, since size of (3 * 1024 * 1024 * sizeof(double)) % 32 bytes == 0, isn't it 32 bytes-aligned or I misunderstood the concept?
  2. I have allocated very large contiguous space, and why it crashes from the beginning, when performing small mat mul (4*4)? I thought that as long as I calling _mm256_load_pd(addr) with at least 32 bytes allocated starting from addr, it won't crash, am I wrong?
  3. Why it doesn't crash on buf of (3 * 1025 * 1025), but crashes on (3 * 1024 * 1024) ? It seems doesn't crashes when the size is an odd number, like 1025, 1027, 1029 and always crash when the number is even, like 1024, 1026.

The code was compiled using GCC, with -march=native and -O3. The CPU supports FMA, AVX and AVX2. The machine is Google Cloud VM, the CPU is Intel Xeon which I cannot get the exact model. Thanks for your advice!

  • 1
    If you lookup `sigaction` (there are probably examples on SO) you can install a handler for SIGSEGV that is provided a `siginfo_t` which describes the cause of the fault in detail. You may need to become a bit familiar with `/usr/include//asm/siginfo.h` to parse the info fields. – mevets Dec 04 '21 at 15:46
  • 1
    Or more simply, run the program in your debugger, which will provide all this information, correlate it to line numbers, and provide a convenient interface for inspecting the entire state of your program at the crash. – Nate Eldredge Dec 04 '21 at 16:32
  • 2
    I don't think `vector` guarantees anything beyond 8-byte alignment, so the pointers `A,B,C` coming in may not reliably have proper alignment to begin. Try printing out those pointer values on entry. Anyway, `buf[0]` and `buf[1025]` can't both be aligned to 32 bytes? – Nate Eldredge Dec 04 '21 at 16:40
  • 1
    related: [How to solve the 32-byte-alignment issue for AVX load/store operations?](https://stackoverflow.com/q/32612190) - some of your loads may need to use `_mm256_loadu_ps` if you don't pad the row stride to a multiple of 4 doubles (32 bytes) separately from the actually matrix geometry. – Peter Cordes Dec 04 '21 at 21:35
  • @PeterCordes - I was going to update my answer to pad row strides, but now I can't undelete it (needs 20 votes?), and it seems your link covers the issue. Also in your edit, you mentioned the original pointer was lost, but that is *M*, and it is never modified (other than malloc and free). – rcgldr Dec 05 '21 at 18:05
  • @rcgldr: When it says "20 votes remaining", that's how many delete/undelete votes *you* have left for the day, not how many the answer needs. Not sure why it wouldn't let you undelete your answer unilaterally with 1 vote, though; did you try? Even if not, it should take at most 3 votes; I can cast one. Re: losing the original, ah yes, I see you're doing one big allocation and chopping it up; that's fine as long as your de-allocation knows about that. I had been just going to comment, and it was getting late so I didn't read super carefully. – Peter Cordes Dec 05 '21 at 20:20
  • @rcgldr: Anyway I'd still highly recommend using C11 `aligned_alloc` or `posix_memalign` as in [How to solve the 32-byte-alignment issue for AVX load/store operations?](https://stackoverflow.com/q/32612190), not over-allocating and doing pointer math. (BTW, `uintptr_t` is the most appropriate type for pointer shenanigans; that might happen to still work on exotic C implementations where the max size of a single object is a lot less than the address-space / pointer width, e.g. a segmented model. AVX won't work in x86-16 real mode, so probably not a real concern *here*.) – Peter Cordes Dec 05 '21 at 20:24
  • @PeterCordes - Visual Studio support for aligned_alloc goes back to VS2015. However, for those rare cases running Windows XP, VS2010 is the last version to run on Win XP. I have a multi-boot system (XP Pro, XP Pro X64, Win 7 Pro 64 bit, Win 10 Pro 64 bit), and occasionally I use Win XP, mosty for older games, and some video | DVD programs. – rcgldr Dec 06 '21 at 01:49
  • @rcgldr: This question only mentions Google Cloud VM, which doesn't run Windows, so no need to avoid ISO C standard features for compat with crufty old Windows. – Peter Cordes Dec 06 '21 at 02:17
  • @rcgldr: if you can't get your answer undeleted, you could always repost a new version of it. – Peter Cordes Dec 07 '21 at 04:55
  • 1
    @PeterCordes - since the question doesn't need to worry about old compilers, an updated version of my answer is not needed. – rcgldr Dec 07 '21 at 10:34

1 Answers1

1

Thanks for the comments and I think I have figured out what was wrong.

Answer to Q1: Yes, I misunderstood the concept of 'Alignment'. I have to ensure that the adress is the multiple of 32. So after manually (not an elegant way by doing pointer math) making 3 matrices starting from 36 bytes aligned space, it works.

Answer to Q2: Since the adress passed to _mm256_load_pd(addr) is not 32 bytes aligned, even if it's very large, an exception might be raised, according to Intel documentation

Answer to Q3: This is quite tricky: Not both of A and C are 32 bytes aligned. However, changing the total size will cause C to be 32 bytes aligned (But A still not aligned), and it doesn't crash. This works just by accident and it's not safe, as A is not aligned.

So vectorization of matrix multiplication is trciky, I need to

  1. Ensure matrix A and C to be 32 bytes aligned, maybe use something like aligned_alloc mentioned in the comments.
  2. pad the matrices into multiple of 4, to ensure in every step calling _mm256_load_pd(addr) the adress is properly aligned.