Matrix Multiplication with AVX

Question

I am trying to write a matrix multiplication using AVX.
The worrying snippet is the following

for(int i=0; i<blockRowsA; i++){
    for(int j=0; j<blockColsB; j+=8){
        // Load 8 floats from myC into X
        __m256 X = _mm256_load_ps(myC.data() + i*blockColsB+j);

        for(int l=0; l<blockRowsB; l++){
            // Calculate the result
            __m256 A256 = _mm256_set1_ps(alignedBuffA[i*blockRowsA + l]);
            __m256 B256 = _mm256_load_ps(&alignedBuffB[l*blockRowsB + j]);

            X = _mm256_fmadd_ps(A256, B256, X);
        }
        // Store X back into myC
        alignas(32) float tempArray[8];
        _mm256_storeu_ps(tempArray, X);
        myC.assign(tempArray, tempArray+8);
    }
}

blockRowsA, blockRowsB and blockColsB are integers, in this case 600 because alignedBuffA and alignedBuffB are each 600x600.
The whole programs also uses MPI with the SUMMA algorithm to split up the blocks, and my snippet runs for each iteration of the algorithm after each process gets the needed data.
alignedBuffA and alignedBuffB are each an array with float* while myC is the local block of the result matrix and is a vector with floats (This was already given).

However, now I have the problem, that

I am not able to load the existing data in myC that might be there
I might not be able to store that data back into the vector (couldn't test because the program throws an error before)

For anyone asking, the algorithm I am trying to implement does the following:

Take 8 elements from B
Multiply each with the first element in the first row from A
Add to that the result from multiplying 8 elements from the second row from B with the second element in the first row from A
Proceed until you reach the bottom of B
Repeat for next column in B

However I need to read the previous results because of the SUMMA algorithm, but nothing I am trying seems to work and throws an error all the time.

Edit:
The following code should be a "small" prototype, that sadly doesn't work because of loading the line.
In this case, this if of course not useful, but including MPI in there would be too much.

#include <cmath>
#include <fstream>
#include <iostream>
#include <iomanip>
#include <vector>

#include <immintrin.h>
#include <numeric>

void init_data(std::vector<float>& data, int rows, int cols) {
    for(int i=0; i<rows; i++)
        for(int j=0; j<cols; j++)
            data[i*cols+j] = (rows-i+j) % 4;
}

int main (int argc, char *argv[]){
    int blockRowsA = 600;
    int blockRowsB = 600;
    int blockColsB = 600;

    std::vector<float> myA(blockRowsA*blockRowsB);
    std::vector<float> myB(blockRowsB*blockColsB);
    std::vector<float> myC(blockRowsA*blockColsB);

    init_data(myA, blockRowsA, blockRowsB);
    init_data(myA, blockRowsB, blockColsB);

    // I don't know how I could read from a vector into __m256 so I used this
    float* alignedBuffA = static_cast<float*>(_mm_malloc(blockRowsA*blockRowsB * sizeof(float),32));
    float* alignedBuffB = static_cast<float*>(_mm_malloc(blockRowsB*blockColsB * sizeof(float),32));
    std::copy(myA.begin(), myA.end(), alignedBuffA);
    std::copy(myB.begin(), myB.end(), alignedBuffB);

    for(int i=0; i<blockRowsA; i++){
        for(int j=0; j<blockColsB; j+=8){
            __m256 X = _mm256_load_ps(myC.data() + i*blockColsB+j);

            for(int l=0; l<blockRowsB; l++){
                __m256 A256 = _mm256_set1_ps(alignedBuffA[i*blockRowsA + l]);
                __m256 B256 = _mm256_load_ps(&alignedBuffB[l*blockRowsB + j]);

                X = _mm256_fmadd_ps(A256, B256, X);
            }
            alignas(32) float tempArray[8];
            _mm256_storeu_ps(tempArray, X);
            std::cout << tempArray[0] << std::endl;
            myC.assign(tempArray, tempArray+8);
        }
    }
}

Edit 2:
This is part of an exercise in college where we specifically got access to a server for this.

?Possible duplicates: "https://www.google.com/search?q=stackoverflow+c%2B%2B+matrix+multiplication+avx&oq=stackoverflow+c%2B%2B+matrix+multiplication+avx&aqs=edge..69i57j69i60.14854j0j1&sourceid=chrome&ie=UTF-8" — Thomas Matthews, Jun 06 '23 at 20:32
@ThomasMatthews I already looked at a lot of that but everything was using a different approach where you transpose the second matrix — Mikecraft1224, Jun 06 '23 at 20:35
questions are much easier to answer when they do contain a question. And when the question is about an error in code you should try to provide a [mcve] together with the error — 463035818_is_not_an_ai, Jun 06 '23 at 20:36
@463035818_is_not_a_number Well the whole thing is like 250 lines of code and some other file but I can try to make a minimal version, I'll update then — Mikecraft1224, Jun 06 '23 at 20:38
@463035818_is_not_a_number I've now included a small example The problem is, that the whole thing is running on a server that doesn't return real errors and just stuff like srun: error: z0436: task 0: Segmentation fault (core dumped) srun: launch/slurm: _step_signal: Terminating StepId=13588681.0 — Mikecraft1224, Jun 06 '23 at 21:18
You can read from a vector directly, but it may not be sufficiently aligned to safely use aligned loads. You can also store directly, but also with unaligned stores. Can you run a version of this locally, so you can debug? E: the potential lack of alignment applies to `_mm256_load_ps(myC.data() + i*blockColsB+j)` for example. — harold, Jun 06 '23 at 21:23
If you want to learn AVX intrinsics so you can vectorize other things, instead of just using a matmul library like Eigen, you need a development environment where you can single-step with a debugger. If your own CPU isn't an x86-64 with AVX+FMA, you can use an emulator while debugging for correctness. Also, if you want to use `std::vector`, use it with an aligned allocator (2nd template parameter, google it). Or use unaligned loads like `_mm256_loadu_ps` until / unless you get around to aligning your allocations. Copying to/from aligned buffers wastes a lot of CPU time. — Peter Cordes, Jun 06 '23 at 21:51
@harold Sadly I am not able to run it locally to debug, that's one of the big problems. — Mikecraft1224, Jun 06 '23 at 22:04
@PeterCordes The emulating sounds interesting maybe I'll try that somewhat, I've also added another edit to explain the situation, because this is part of a college exercise — Mikecraft1224, Jun 06 '23 at 22:07
[Intel AVX intrinsics: any compatibility library out?](https://stackoverflow.com/q/2708501) / [How to test AVX-512 instructions w/o supported hardware?](https://stackoverflow.com/q/51805127) (also works for AVX1). IDK if this would work on an AArch64 Mac or something, but probably yes, under rosetta it would act like it was running on an x86-64-v2 and be able to emulate. — Peter Cordes, Jun 06 '23 at 22:11

Matrix Multiplication with AVX

0 Answers0