0

I am trying to get a working example of multiplying 2 matrix using SIMD because i need to compare the time of the algorithm with a "normal" one. Here is why i tried doing Efficient 4x4 matrix multiplication (C vs assembly) .

#include <xmmintrin.h>
#include <stdio.h>


void M4x4_SSE(float *A, float *B, float *C) {
    __m128 row1 = _mm_load_ps(&B[0]);
    __m128 row2 = _mm_load_ps(&B[4]);
    __m128 row3 = _mm_load_ps(&B[8]);
    __m128 row4 = _mm_load_ps(&B[12]);
    for(int i=0; i<4; i++) {
        __m128 brod1 = _mm_set1_ps(A[4*i + 0]);
        __m128 brod2 = _mm_set1_ps(A[4*i + 1]);
        __m128 brod3 = _mm_set1_ps(A[4*i + 2]);
        __m128 brod4 = _mm_set1_ps(A[4*i + 3]);
        __m128 row = _mm_add_ps(
                    _mm_add_ps(
                        _mm_mul_ps(brod1, row1),
                        _mm_mul_ps(brod2, row2)),
                    _mm_add_ps(
                        _mm_mul_ps(brod3, row3),
                        _mm_mul_ps(brod4, row4)));
        _mm_store_ps(&C[4*i], row);
    }
}


int main(){

  float A[4] __attribute__((aligned(16))) = {1,2,3,4};
  float B[4] __attribute__((aligned(16))) = {5,6,7,8};
  float C[4] __attribute__((aligned(16)));

  M4x4_SSE(A,B,C);

}

I am not familiar with c or c++ so it has been difficult, i get:

*** stack smashing detected ***: ./prueba terminated
Aborted (core dumped)

when i run my program. I need to scale to a 500x500 matrix at least. Thanks

Community
  • 1
  • 1
Lvargas
  • 3
  • 1
  • 3

1 Answers1

4

The arrays you declare in main have 4 elements each, but your multiplication code reads and writes 16 elements each. Writing past the allocated space (elements 4 and later, in the second iteration of your i loop) will clobber the stack resulting in the error you see.

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
  • i think i m having problems understanding the code, what should a proper input be? – Lvargas Jun 29 '16 at 04:29
  • 1
    @Lvargas a 4x4 matrix will have 16 elements in it. Your float arrays in main should be, for example, `float C[16];` with more initializers for `A` and `B`. Alternatively, you can use `float C[4][4]` to get a `C[row][col]` notation. – 1201ProgramAlarm Jun 29 '16 at 04:36
  • 2
    @Lvargas, use `float A[16], float B[16], float C[16];` But if you have to ask this question then it's too early for you to be using intrinsics. Start with standard C code. – Z boson Jun 29 '16 at 06:10
  • thanks! do you know if the only way yo scale this is just creating more vectors? if i want to multiply 100x100 how can i scale these? – Lvargas Jun 29 '16 at 14:13
  • @Lvargas That is not a trivial change. The 4x4 code features several optimizations, including loop unrolling. If you're going to be using 100x100 or 500x500 matrices you are probably better off using an existing matrix library that will be able to handle those large sizes efficiently. – 1201ProgramAlarm Jun 30 '16 at 05:34