1

I want to multiply two NxN matrices using SIMD. I want to do matrix multiplication for 64-bit integers, and multiply one element of a matrix with another element with the same index. For example:

c[1][1] = a[1][1] * b[1][1]

An error occurs when multiplying with the _mm256_mullo_epi64 operation. I can't figure out why this is happening. Can I write the resulting value into a 256-bit register?

#include <iostream>
#include <immintrin.h>

using namespace std;

int avx_mult(__int64** A, __int64** B, __int64** C, int N) {
    cout << "AVX mult:" << endl;
    if (N < 4 || N % 4 != 0) return 0;

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j += 4) {
            // filling the resulting AMX register with zeros
            __m256i c_line = _mm256_setzero_si256();
            // load 4 long long elements from array A into AMX register
            __m256i a_line = _mm256_loadu_si256((__m256i*) & A[i][j]);
            // load 4 long long elements from array B into AMX register
            __m256i b_line = _mm256_loadu_si256((__m256i*) & B[i][j]);

            // multiplying two AVX registers
            c_line = _mm256_mullo_epi64(a_line, b_line);
        }
    }
}

int main() {
    const unsigned int N = 4; // array size

    __int64** A = new __int64* [N]; 
    __int64** B = new __int64* [N];
    __int64** C = new __int64* [N]; 

    for (int i = 0; i < N; i++) {
        A[i] = new __int64[N];
        B[i] = new __int64[N];
        C[i] = new __int64[N];
    }

    for (int i = 0; i < N; i++) { // filling arrays
        for (int j = 0; j < N; j++) {
            A[i][j] = __int64(rand() % 10);
            B[i][j] = __int64(rand() % 10);
            C[i][j] = __int64(0);
        }
    }

    avx_mult(A, B, C, N);

    for (int i = 0; i < N; i++) {
        delete[] A[i];
        delete[] B[i];
        delete[] C[i];

    }
    delete[] A;
    delete[] B;
    delete[] C;
}

The code compiles, but the program stops on this line:

c_line = _mm256_mullo_epi64(a_line, b_line);

... with exit code 0xC000001D: Illegal Instruction exception.

The Intel Intrinsics Documentation for _mm256_mullo_epi64 says:

Synopsis

__m256i _mm256_mullo_epi64 (__m256i a, __m256i b)
#include <immintrin.h>
Instruction: vpmullq ymm, ymm, ymm
CPUID Flags: AVX512DQ + AVX512VL

Description

Multiply the packed 64-bit integers in a and b, producing intermediate 128-bit integers, and store the low 64 bits of the intermediate integers in dst.

My function arguments fit the description. Or is there some mistake?

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
hellicop11
  • 13
  • 2
  • 1
    There shouldn't be any AMX registers involved by the way, that's a different thing – harold Jun 08 '23 at 23:35
  • 1
    @Jan Schultke - nice edit, but I wouldn't have inlined the image. It isn't actually telling us anything beyond the fact that there's an Illegal Instruction exception on that source line, just confirming the question title. And it's not surprising that the AVX-512 instruction faulted, so there isn't a mystery that needs full details. (Also, the text in the image is unreadably small when inlined, so if I was inlining in another case, I'd make the image a link to the image URL, like `[ ![alt text][1] ][1]` but without the spaces. So people can click it to open it.) – Peter Cordes Jun 09 '23 at 15:12
  • @PeterCordes thanks, I'll keep that in mind for the future. I've made another edit and removed the image which was kinda unnecessary. – Jan Schultke Jun 09 '23 at 15:20

1 Answers1

4

Not every x86_64 processor supports every instruction. Namely, _mm256_mullo_epi64 requires AVX-512 extensions, and if the rest of your code works, but running this intrinsic result in executing an illegal instruction, then you're most likely running this code on a processor without AVX-512.

You can implement packed 64-bit multiplication with just AVX2 as well:

__m256i mul64_haswell (__m256i a, __m256i b) {
    __m256i bswap   = _mm256_shuffle_epi32(b,0xB1);
    __m256i prodlh  = _mm256_mullo_epi32(a,bswap);

    __m256i prodlh2 = _mm256_srli_epi64(prodlh, 32);
    __m256i prodlh3 = _mm256_add_epi32(prodlh2, prodlh);
    __m256i prodlh4 = _mm256_and_si256(prodlh3, _mm256_set1_epi64x(0x00000000FFFFFFFF));

    __m256i prodll  = _mm256_mul_epu32(a,b);
    __m256i prod    = _mm256_add_epi64(prodll,prodlh4);
    return  prod;
}

This code is taken from @PeterCordes' answer to Fastest way to multiply an array of int64_t?

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
  • I checked my processor, indeed my processor does not have the AVX-512 extension. I tested this code, but I can't figure out how it works and why we perform some operations. Could you explain how this code works? If __m256i a = [1, 4, 9, 8], __m256i b = [7, 0, 4, 8] or other integers. Having difficulty understanding _mm256_shuffle_epi32(b,0xB1); – hellicop11 Jun 10 '23 at 19:52
  • @hellicop11 This is basically "high-school multiplication" of two-digit numbers, except we use digits base 32, not base 10. `l` and `h` in the code refer to the low and high (32-bit) digit respectively. `_mm256_shuffle_epi32(b, 0xB1)` has a shuffle mask of `[2, 3, 0, 1]`, so it swaps the low and high digit of `b`. The example you have given is simple, because everything but `prodll` should be zero (at least after applying the mask `0x00..ff`). – Jan Schultke Jun 11 '23 at 11:37