I am trying to write a matrix multiplication using AVX.
The worrying snippet is the following
for(int i=0; i<blockRowsA; i++){
for(int j=0; j<blockColsB; j+=8){
// Load 8 floats from myC into X
__m256 X = _mm256_load_ps(myC.data() + i*blockColsB+j);
for(int l=0; l<blockRowsB; l++){
// Calculate the result
__m256 A256 = _mm256_set1_ps(alignedBuffA[i*blockRowsA + l]);
__m256 B256 = _mm256_load_ps(&alignedBuffB[l*blockRowsB + j]);
X = _mm256_fmadd_ps(A256, B256, X);
}
// Store X back into myC
alignas(32) float tempArray[8];
_mm256_storeu_ps(tempArray, X);
myC.assign(tempArray, tempArray+8);
}
}
blockRowsA, blockRowsB and blockColsB are integers, in this case 600 because alignedBuffA and alignedBuffB are each 600x600.
The whole programs also uses MPI with the SUMMA algorithm to split up the blocks, and my snippet runs for each iteration of the algorithm after each process gets the needed data.
alignedBuffA and alignedBuffB are each an array with float* while myC is the local block of the result matrix and is a vector with floats (This was already given).
However, now I have the problem, that
- I am not able to load the existing data in myC that might be there
- I might not be able to store that data back into the vector (couldn't test because the program throws an error before)
For anyone asking, the algorithm I am trying to implement does the following:
- Take 8 elements from B
- Multiply each with the first element in the first row from A
- Add to that the result from multiplying 8 elements from the second row from B with the second element in the first row from A
- Proceed until you reach the bottom of B
- Repeat for next column in B
However I need to read the previous results because of the SUMMA algorithm, but nothing I am trying seems to work and throws an error all the time.
Edit:
The following code should be a "small" prototype, that sadly doesn't work because of loading the line.
In this case, this if of course not useful, but including MPI in there would be too much.
#include <cmath>
#include <fstream>
#include <iostream>
#include <iomanip>
#include <vector>
#include <immintrin.h>
#include <numeric>
void init_data(std::vector<float>& data, int rows, int cols) {
for(int i=0; i<rows; i++)
for(int j=0; j<cols; j++)
data[i*cols+j] = (rows-i+j) % 4;
}
int main (int argc, char *argv[]){
int blockRowsA = 600;
int blockRowsB = 600;
int blockColsB = 600;
std::vector<float> myA(blockRowsA*blockRowsB);
std::vector<float> myB(blockRowsB*blockColsB);
std::vector<float> myC(blockRowsA*blockColsB);
init_data(myA, blockRowsA, blockRowsB);
init_data(myA, blockRowsB, blockColsB);
// I don't know how I could read from a vector into __m256 so I used this
float* alignedBuffA = static_cast<float*>(_mm_malloc(blockRowsA*blockRowsB * sizeof(float),32));
float* alignedBuffB = static_cast<float*>(_mm_malloc(blockRowsB*blockColsB * sizeof(float),32));
std::copy(myA.begin(), myA.end(), alignedBuffA);
std::copy(myB.begin(), myB.end(), alignedBuffB);
for(int i=0; i<blockRowsA; i++){
for(int j=0; j<blockColsB; j+=8){
__m256 X = _mm256_load_ps(myC.data() + i*blockColsB+j);
for(int l=0; l<blockRowsB; l++){
__m256 A256 = _mm256_set1_ps(alignedBuffA[i*blockRowsA + l]);
__m256 B256 = _mm256_load_ps(&alignedBuffB[l*blockRowsB + j]);
X = _mm256_fmadd_ps(A256, B256, X);
}
alignas(32) float tempArray[8];
_mm256_storeu_ps(tempArray, X);
std::cout << tempArray[0] << std::endl;
myC.assign(tempArray, tempArray+8);
}
}
}
Edit 2:
This is part of an exercise in college where we specifically got access to a server for this.