2

I inspired myself from this link to code a multiplicator of matrix which are multiple of 4: SSE matrix-matrix multiplication

I came up with something somewhat similar, but I observed that if the for loop with j increase by 4 like in the suggest code, it only fill 1 column each 4 column ( which make sense). I can decrease the for loop by 2, and the result is that only half of the column are filled.

So logically, the solution should be to only increase the loop by 1, but when I make the change in the code, I get either segfault error if I use_mm_store_ps or data corrupted size vs. prev_size if I use _mm_storeu_ps, which makes me believe that the data is simply not align.

What and how should I align the data to not cause such error and fill the resulting matrix?

Here is the code I have so far:

void mat_mult(Matrix A, Matrix B, Matrix C, n) {
   for(int i = 0; i < n; ++i) {
   for(int j = 0; j < n; j+=1) {   
       __m128 vR = _mm_setzero_ps();
       for(int k = 0; k < n; k++) {  
           __m128 vA = _mm_set1_ps(A(i,k));  
           __m128 vB = _mm_loadu_ps(&B(k,j));  
           vR = _mm_add_ss(vR,vA*vB);
       }
       _mm_storeu_ps(&C(i,j), vR);
   }
 }
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • How is the `Matrix` class defined? Please provide a [mcve]. – Nate Eldredge Nov 21 '21 at 20:53
  • I can't post the whole class here but pretty much the data is stored in a float pointer which has been aligned with ```static_cast(aligned_alloc(64, n * n * sizeof(float)))```. Plus there is operator for ```()``` to access element of the matrix. – NearFinished_CS Nov 21 '21 at 20:54
  • Always use a debugger to see exactly where you segfault. If it's on a `movaps` instruction (or a legacy-SSE math instruction with a memory source operand, like `addps xmm0, [rdi]`), then alignment is a possible problem. Otherwise it's not, like when you corrupt malloc bookkeeping info by storing outside your array, and it faults somewhere else. – Peter Cordes Nov 22 '21 at 03:46
  • You can't use SIMD on a matrix multiply unless you transpose the second matrix. Think about it. When you pull four consecutive values from matrix 1, the corresponding four elements you need from matrix 2 are VERTICAL. They are not contiguous. Since the cost of transposing is usually pretty high, that makes this a poor choice for paralellization. – Tim Roberts Nov 22 '21 at 09:19
  • @TimRoberts: Or unless you re-structure the loop to run along rows of both matrices *and += into memory in the destination*, instead of trying to accumulate one vector of row * column dot products. That's one well-known strategy, shown for example in an appendix in [What Every Programmer Should Know About Memory?](https://stackoverflow.com/a/8126441) (along with a cache-blocked version). It's not great that it requires 3 loads and a store for every FMA, but that can be amortized slightly. Your answer about writing past the end of the array was likely correct; I had upvoted for that. – Peter Cordes Nov 22 '21 at 09:27
  • looks like no one noticed it. `vA*vB` haha – Алексей Неудачин Nov 22 '21 at 09:53
  • @АлексейНеудачин: `va * vB` works in GNU C, where `__m128` is defined as a GNU C native vector. For `__m128` types, it's exactly equivalent to `_mm_mul_ps(vA, vB)`. So if it compiles at all, it works, it's simply not portable to MSVC. The other mainstream x86 compilers all support GNU extensions. The bug here is `_mm_add_ss` which only adds the low element. – Peter Cordes Nov 22 '21 at 12:09

1 Answers1

1

I corrected your code, also implemented quite a lot of other supplementary code to fully run tests and print outputs, including that I needed to implement Matrix class from scratch. Following code can be compiled in C++11 standard.

Main corrections to your function are: you should handle separately a case when number of B columns is not multiple of 4, this uneven tail case should be handled by separate loop, you should actually run j loop in steps of 4 (as 128-bit SSE float-32 register contains 4 floats), you should use _mm_mul_ps(vA, vB) instead of vA * vB.

Main bug of your code is that instead of yours _mm_add_ss() you should use _mm_add_ps() because you need to add not single value but 4 of them separately. Only due to usage of _mm_add_ss() you were observed that only 1 out of 4 columns was filled (the rest 3 were zeros).

Alternatively you can fix work of your code by using _mm_load_ss() instead of _mm_loadu_ps() and _mm_store_ss() instead _mm_storeu_ps(). After only this fix your code will give correct result, but will be slow, it will be not faster than regular non-SSE solution. To actually gain speed you have to use only ..._ps() instructions everywhere, also handle correctly case of non-multiple of 4.

Because you don't handle case of B columns being non-multiple of 4, because of this your program segfaults, you just store memory out of bounds of matrix C.

Also you asked a question about alignment. Don't ever use aligned store/load like _mm_store_ps()/_mm_load_ps(), always use _mm_storeu_ps()/_mm_loadu_ps(). Because unaligned access instructions are guaranteed to be of same speed as aligned access instructions for same memory pointers values. But aligned instructions may segfault. So unaligned is always better, same speed and never segfault. It used to be in old time on old CPUs that aligned instructions where faster, but right now they are implemented in CPU with exactly same speed. Aligned instructions don't give any profit, only segfaults. But still you may want to use aligned instructions to intentionally segfault if you want to make sure that your program's memory pointers are always aligned.

I implemented also a separate function with reference slow multiplication of matrices, in order to run a reference test to check the correctness of fast (SSE) multiplication.

As commented out by @АлексейНеудачин, my previous version of Matrix class was allocating unaligned memory for array, now I implemented new helper class AlignmentAllocator which ensures that Matrix is allocating aligned memory, this allocator is used by std::vector<> that stores underlying Matrix's data.

Full code with all the corrections, tests and console outputs plus all the extra supplementary code is below. See also console output after the code, I do print two matrices produced by two different multiplication functions, so that two matrices can be compared visually. All test cases are generated randomly. Scroll down my code a bit to see your fixed function mat_mult(). Also click on Try it online! link if you want to see/run my code online.

Try it online!

#include <cmath>
#include <iostream>
#include <vector>
#include <random>
#include <stdexcept>
#include <string>
#include <iomanip>
#include <cstdlib>

#include <malloc.h>
#include <immintrin.h>

using FloatT = float;

template <typename T, std::size_t N>
class AlignmentAllocator {
public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

public:
    inline AlignmentAllocator() throw() {}
    template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & v) { new (p) value_type(v); }
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
    bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};

template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
    #ifdef _MSC_VER
        auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        auto p = (pointer)aligned_alloc(N, n * sizeof(value_type));
    #endif
    if (!p)
        throw std::bad_alloc();
    return p;
}

template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
    #ifdef _MSC_VER
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}

static size_t constexpr MatrixAlign = 64;

template <typename T, size_t Align = MatrixAlign>
using AlignedVector = std::vector<T, AlignmentAllocator<T, Align>>;

class Matrix {
public:
    Matrix(size_t rows, size_t cols)
        : rows_(rows), cols_(cols) {
        cols_aligned_ = (sizeof(FloatT) * cols_ + MatrixAlign - 1)
            / MatrixAlign * MatrixAlign / sizeof(FloatT);
        Clear();
        if (size_t(m_.data()) % 64 != 0 ||
                (cols_aligned_ * sizeof(FloatT)) % 64 != 0)
            throw std::runtime_error("Matrix was allocated unaligned!");
    }
    Matrix & Clear() {
        m_.clear();
        m_.resize(rows_ * cols_aligned_);
        return *this;
    }
    FloatT & operator() (size_t i, size_t j) {
        if (i >= rows_ || j >= cols_)
            throw std::runtime_error("Matrix index (" +
            std::to_string(i) + ", " + std::to_string(j) + ") out of bounds (" +
            std::to_string(rows_) + ", " + std::to_string(cols_) + ")!");
        return m_[i * cols_aligned_ + j];
    }
    FloatT const & operator() (size_t i, size_t j) const {
        return const_cast<Matrix &>(*this)(i, j);
    }
    size_t Rows() const { return rows_; }
    size_t Cols() const { return cols_; }
    bool Equal(Matrix const & b, int round = 7) const {
        if (Rows() != b.Rows() || Cols() != b.Cols())
            return false;
        FloatT const eps = std::pow(FloatT(10), -round);
        for (size_t i = 0; i < Rows(); ++i)
            for (size_t j = 0; j < Cols(); ++j)
                if (std::fabs((*this)(i, j) - b(i, j)) > eps)
                    return false;
        return true;
    }
private:
    size_t rows_ = 0, cols_ = 0, cols_aligned_ = 0;
    AlignedVector<FloatT> m_;
};

void mat_print(Matrix const & A, int round = 7, size_t width = 0) {
    FloatT const pow10 = std::pow(FloatT(10), round);
    for (size_t i = 0; i < A.Rows(); ++i) {
        for (size_t j = 0; j < A.Cols(); ++j)
            std::cout << std::setprecision(round) << std::fixed << std::setw(width)
                << std::right << (std::round(A(i, j) * pow10) / pow10) << " ";
        std::cout << std::endl;;
    }
}

void mat_mult(Matrix const & A, Matrix const & B, Matrix & C) {
    if (A.Cols() != B.Rows())
        throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
    if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
        throw std::runtime_error("Wrong C rows, cols!");
    
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = 0; j < B.Cols() - B.Cols() % 4; j += 4) {
            auto sum = _mm_setzero_ps();
            for (size_t k = 0; k < A.Cols(); ++k)
                sum = _mm_add_ps(
                    sum,
                    _mm_mul_ps(
                        _mm_set1_ps(A(i, k)),
                        _mm_loadu_ps(&B(k, j))
                    )
                );
            _mm_storeu_ps(&C(i, j), sum);
        }
        
    if (B.Cols() % 4 == 0)
        return;
        
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = B.Cols() - B.Cols() % 4; j < B.Cols(); ++j) {
            FloatT sum = 0;
            for (size_t k = 0; k < A.Cols(); ++k)
                sum += A(i, k) * B(k, j);
            C(i, j) = sum;
        }
}

void mat_mult_slow(Matrix const & A, Matrix const & B, Matrix & C) {
    if (A.Cols() != B.Rows())
        throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
    if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
        throw std::runtime_error("Wrong C rows, cols!");
        
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = 0; j < B.Cols(); ++j) {
            FloatT sum = 0;
            for (size_t k = 0; k < A.Cols(); ++k)
                sum += A(i, k) * B(k, j);
            C(i, j) = sum;
        }
}

void mat_fill_random(Matrix & A) {
    std::mt19937_64 rng{std::random_device{}()};
    std::uniform_real_distribution<FloatT> distr(-9.99, 9.99);
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = 0; j < A.Cols(); ++j)
            A(i, j) = distr(rng);
}

int main() {
    try {
        {
            Matrix a(17, 23), b(23, 19), c(17, 19), d(c.Rows(), c.Cols());
            
            mat_fill_random(a);
            mat_fill_random(b);
            mat_mult_slow(a, b, c);
            
            mat_mult(a, b, d);
            
            if (!c.Equal(d, 5))
                throw std::runtime_error("Test failed, c != d.");
        }
                
        {
            Matrix a(3, 7), b(7, 5), c(3, 5), d(c.Rows(), c.Cols());
            
            mat_fill_random(a);
            mat_fill_random(b);
            mat_mult_slow(a, b, c);
            
            mat_mult(a, b, d);
            
            mat_print(c, 3, 8);
            std::cout << std::endl;
            mat_print(d, 3, 8);
        }
            
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

 -37.177 -114.438   36.094  -49.689 -139.857 
  22.113 -127.210  -94.434  -14.363   -6.336 
  71.878   94.234   33.372   32.573   73.310 

 -37.177 -114.438   36.094  -49.689 -139.857 
  22.113 -127.210  -94.434  -14.363   -6.336 
  71.878   94.234   33.372   32.573   73.310 
Arty
  • 14,883
  • 6
  • 36
  • 69
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/239593/discussion-on-answer-by-arty-simd-matrix-multiplication-causing-segfault-or-sega). – Machavity Nov 26 '21 at 03:27