Multithreaded matrix by matrix multiplication

Question

I'm trying to create application like in title. Generally, almost everything work fine, except multithreading in ASM. When I want to use multithreading during multiply, sometimes some elements from the first column in the first row in the result matrix is 0. I was trying everything and I can't find my mistake, so I want to ask your help. There is code written in c++:

void multiply(int* resultRow, int* row, int* column, int size)

    {
        int* startCol = column;
        int* start = row;
        for (int i = 0; i < size; i++)
        {
            column = startCol;
            column += i;
            (*resultRow) = 0;
            row = start;
            for (int j = 0; j < size; j++)
            {
                (*resultRow) += ((*row) * (*column));
                row++;
                column += size;
            }
            resultRow++;
        }
    }

This function works fine even with multithreading(resultRow is address of i-th row in matrix, row and columns are addresses of exact row and column in matrices I want to multiply) There is an ASM code:

         .CODE

;-------------------------------------------------------------------------
;-------------------------------------------------------------------------

AsmMultiplication PROC loopCount: qword, secondLoopCount: qword, startColAddress : qword, startRowAddress : qword, count : qword, matrixSize : qword                                                                                                                                                               
                        ; resultRow in RCX
                        ; rowToMultiply in RDX
                        ; colToMultiply in R8
                        ; size int R9
mov matrixSize, R9
mov loopCount, R9
mov secondLoopCount, R9
mov count, 0

mov R10, RDX
mov R9, RCX

mov startColAddress, R8
mov startRowAddress, R10

loop1:
mov R8, startColAddress             ; column = startColAddress
mov R10, startRowAddress            ; row = startRowAddress

mov RAX, count                      ; |
mov RCX, 4                          ; |
mul RCX                             ; |
add R8, RAX                         ; | column += i

xor RAX, RAX                        ; |
mov [R9], RAX                       ; (*resultRow) = 0

mov RAX, matrixSize                 ; |
mov loopCount, RAX                  ; |
pxor xmm2, xmm2                     ; | preparing for multiplying in loop2
inc count
    
            loop2:
            movq xmm0, qword ptr [R10]          ;move actual row element to vector
            movq xmm1, qword ptr [R8]           ;move actual column element to vector
            pmuludq xmm0, xmm1                  ;multiply vectors
            paddq xmm2, xmm0                    ;add result to third vector

            add R10, 4                          ; row++

            mov RAX, matrixSize                 ; |
            mov RDX, 4                          ; |
            mul RDX                             ; |
            add R8, RAX                         ; | column += size

            mov RDX, loopCount                  ; |
            dec RDX                             ; | decrementing loop counter
            mov loopCount, RDX                  ; | 
            jnz loop2                           ; | if loopCount == 0 break

movq RAX, xmm2                      ; |
mov [R9], RAX                       ; | resultRow = rows * columns
add R9, 4                           ; | resultRows++

mov RDX, secondLoopCount            ; |
dec RDX                             ; |
mov secondLoopCount, RDX            ; |
jnz loop1                           ; | if secondLoopCount == 0 break


ret
AsmMultiplication ENDP


end

There how im using multithreading:

    public void ThreadedFunction(int size, int rows)
            {
                unsafe
                {
                    fixed (int* resultRow = &m3.matrix[rows, 0])
                    fixed (int* rowToMultiply = &m1.matrix[rows, 0])
                    fixed (int* colToMultiply = &m2.matrix[0, 0])
                        if(Asm == false)
                        {
                            MatrixMultiplication.App.multiply(resultRow, rowToMultiply, colToMultiply, size);
                        }
                        else
                        {
                            MatrixMultiplication.App.AsmMultiplication(resultRow, rowToMultiply, colToMultiply, size);
                        }
                        
                }
            }
    ...

 for (int i = 0; i < threadsCount; i++)
                {
                    threads[i] = this.StartTheThread(size, rows);
                    rows++;

There is a simple result matrix in .txt file with multithreading: Correct: enter image description here

and not correct: enter image description here

i have no idea why sometimes output is correct and sometimes not, but that mistake is only i first row at result matrix. Could anyone explain whats wrong? I know im using in n threads the same rows and columns in matrixes but why then c++ code is working fine ?

Well, the first thing that jumps out at me is you aren't [saving the nonvolatile registers](https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170#callercallee-saved-registers). — David Wohlferd, Nov 12 '21 at 21:50
Can i just change that RDI register on any volatile register? — Tom5821, Nov 12 '21 at 22:05
Changing rdi to (say) r11 would probably help. But there's also rsi. While you might be able to squeeze out another register somewhere, using push/pop might be an option as well. — David Wohlferd, Nov 12 '21 at 22:07
Is using RSI a problem when i keep the same value there all time? — Tom5821, Nov 12 '21 at 22:33
The problem you are trying to solve is that the code that calls you may have put a value in rsi. When your function returns, it has every right to assume the value it put there will still be there. — David Wohlferd, Nov 12 '21 at 22:34
I have changed these registers, but still have the same problem ;/ — Tom5821, Nov 12 '21 at 22:37
When you call this multithreaded, does each thread get their own `resultRow`? — David Wohlferd, Nov 12 '21 at 22:41
Exactly, im passing pointer on resultmatrix[row, 0] on each thread, also im passing pointer on matrix[0,0] where im iterating through rows and pointer on matrix[row, 0] where im iterating through columns — Tom5821, Nov 12 '21 at 22:47
I thought that in one moment two threads want to read from the same memory cell because sometimes that code works and sometimes it does not. — Tom5821, Nov 12 '21 at 22:51
I don't see any multi-threading; are you talking about SIMD `pmuludq`? About that, it's probably a bad idea to be doing 64-bit loads/stores around a 32 x 32 => 64-bit multiply on 32-bit `int` data; look at exactly what https://www.felixcloutier.com/x86/pmuludq does. If you're going to use SIMD at all, SSE4.1 `pmulld` does packed 32x32 => 32-bit multiply. But since you're only doing one element at once it seems just use scalar 32-bit registers like `mov eax, [R10]` / `imul eax, [r8]`. — Peter Cordes, Nov 13 '21 at 03:19
Also, don't use `mul` to multiply by 4, especially not inside the loop. Calculate the row stride in some other reg, like RDX, with `mov rdx, r9` / `shl rdx, 2`, or `lea rdx, [r9*4]`. Never use `mul` unless you want the full 128-bit `rdx:rax` result of 64x64 multiplication. — Peter Cordes, Nov 13 '21 at 03:22
(BTW, if you were actually going to use SIMD, you'd want to use a different loop structure so you can do contiguous loads from rows, instead of trying to do strided loads down a column. See for example the SIMD FP matmul in [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311)/). But anyway, before you even think about multi-threading, sort out your code for a single thread first to use the right element-size (32-bit dword = `int`) for scalar loads/stores without overlapping onto the next element when storing. — Peter Cordes, Nov 13 '21 at 04:53
@PeterCordes I put how i use multithreading in my question now. — Tom5821, Nov 13 '21 at 10:59
I have changed pmuludq on pmulld and its almost working... now sometimes there is a difference by 1 in element from first column in result matrix for example should be 450 but there is 449. — Tom5821, Nov 13 '21 at 11:07
i think i have solved my problem. There: `movq RAX, xmm2 mov [R9], RAX ` i should do `mov [R9], EAX` Now it works. Thanks everyone for attention and help. — Tom5821, Nov 13 '21 at 12:39
Or just do `movd [r9], xmm0`, but yes like I said, dword store. Bouncing through an integer register is just a waste of an instruction. (Of course there's zero point in using XMM0 for one scalar element in the first place, like I said you should just use `imul`. But if you insist on XMM for scalar, use `movd` loads and `pmuludq`; `pmulld` is more expensive on Intel since Haswell, and pointless if you don't care about the results other than the low dword of the low element. It's only worth using if you actually are going to take advantage of 4 packed 32x32 => 32-bit multiplies.) — Peter Cordes, Nov 13 '21 at 14:54
If you're actually doing this for performance reasons, you'd get more efficient asm than this out of a C compiler with optimization enabled. But for large matrices, naive matmul is terrible for cache locality, so there's probably even more to gain from cache-blocking and other single-thread optimizations than from multithreading. Threading can help, but I'd recommend getting it efficient (and fully correct, with the right width of loads) on a single core first. — Peter Cordes, Nov 13 '21 at 15:45
@PeterCordes i have to use some vector instructions. I will be trying multiply 2 elements at once in xmm and just move those elements into two elements of result row. This option would be better than using vector instructions to multiply just one element in xmm. — Tom5821, Nov 13 '21 at 21:15
About multithreading - I noticed that multithreaded multiplying is not as fast as single thread multiplying (I wasn’t testing extremely big matrices, max 1000 x 1000). About algorithm - I’m not looking as much on performance but the way I wanted to multiply matrices + I’m not a asm master as you can see in my code :D(in actual code I have changed everything you suggested, thank you for teaching me something new) — Tom5821, Nov 13 '21 at 21:21
The main purpose of my program is to compare performance we can get using asm( that’s why I have the same function in C++ dll, to compare how much time takes multiplying in c++ and asm). — Tom5821, Nov 13 '21 at 21:29
If you're going to use SIMD, it doesn't make sense to only do 2 elements at once. Your sources and destination are all 32-bit integers, so there's no need for widening multiply. And `pmuludq` doesn't take the low two 32-bit elements of the source so you can't feed it with just `movq`. Also, if you want to stride down 4 columns at once (with SIMD), you need to broadcast each row element separately so every set of 4x column elements gets multiplied by each row element. — Peter Cordes, Nov 13 '21 at 21:33
The speed of a matmul depends heavily on implementation quality. Especially in asm, you'd be comparing *your skill* at hand-optimizing asm vs. a compiler if you use the same algorithm (with intrinsics like `_mm_loadu_si128` to use SIMD in C++), not the performance of asm as a language. That's fine as long as you realize that's what you're doing. The recommended way to use SIMD these days is intrinsics, not hand-written asm, especially for x86 where compilers do a good job. — Peter Cordes, Nov 13 '21 at 21:37
If you use optimized compiler output as a starting point for your optimization effort, you can't do worse than a C++ compiler as long as you benchmark to make sure any changes didn't make it worse. See also [Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?](https://stackoverflow.com/q/40354978)/ — Peter Cordes, Nov 13 '21 at 21:37
Okey, I think I get it but that program is a school project and some conditions like multithreading or vector instructions are ‘must have’ in my case. I would like to back to the vector instructions - I wanted to fill first xmm for example with 2 elements of row and second xmm fill with 2 elements of column and these elements multiply then send them to 2 next elements in result matrix, it is possible? — Tom5821, Nov 13 '21 at 21:49

Multithreaded matrix by matrix multiplication

0 Answers0