perfomance of CUDA matrix multiplication

Question

It will be very nice if you help me to clarify some details about GPU perfomance, because I have stuck here for several weeks. Besides, I'm very sorry for my poor English, but I will try to do my best to explain the problem.

So, about my questions. Let's look at the very simple program - dense matrix multiplication using shared memory. As I understand, Nvidia provides one of it's implementations in cuda programming guide(here is the link): http://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory

It is very simple, and I think everyone who are familiar with CUDA have already seen it. But let's measure this kernel's performance (Gflops). Using "Nvprof" utility we can measure some metrics to compute count of floating point operations, and using cuda events we can measure the execution time of the kernel.

So, for square matrix multiplication (2048x2048 float elements in each matrix), we have (1.7180e+10)/(0.054 * 10^9) Gflpos = 318 Gflops.

Now it's important to say that I'm using GeForce GTX Titan card with peak performance about 3.1 Tflops on single precision. Therefore we have only reached about 1/10 of peak performance, but we have already used all optimizations that I know from my university CUDA course(shared memory, coalesced memory access and so on). Here I would guess that it because it is memory bound problem, but as far as I know it is not right. As an example cuBlas(if I'm right) SGEMM function reaches about 71% of peak perfomance. Of course I understand that it`s very hard to reach cuBlas perfomance, but why can't I reach even 1 Tflop?

So, the questions are:

1) am I right in my reasoning?

2) what are the main reasons why can't I reach even half of peak perfomance?

3) what other optimizations can I use? (here everything that you know will be very usefull - articles, suggestion and so on)

Thank you for your attention!

Are you familiar with [Google Scholar](http://scholar.google.com/)? Simple queries using the keywords "GPU" "CUDA" "GEMM" yield a nice collection of articles dealing with strategies for high-performance implementations. The remaining part of performance beyond algorithms are finely-tuned hand-crafted assembly language implementations (this pretty much applies to all computing platforms). — njuffa, Dec 05 '14 at 17:38
Please show your code. As it stands your question is, in my opinion, too broad to get a definitive answer. There's too many possible places to suggest improvement without seeing your implementation. — user703016, Dec 05 '14 at 17:39
About the code: I was talking about improving perfomance of nvidia code, and I have provided the link. Of course I have my own implementaton(1.4 times faster), but I have decided not to provide it here, because this is not an acceleration that I am seeking, and also I wished to avoid talking about probable errors in my own code. So let's consider that we want to accelerate Nvidia code. — Ilya Afanasiev, Dec 05 '14 at 18:35
Are you sure that there is a way to improve a programm using assembley so many times? As I understand we have to jump from 1/10 of peak perfomance to 7/10 of peak perfomance. Seems unreal for me. Moreover CUDA is C-style code, which is low-level and quite close to assembly. Or there is something I do't understand, so sorry for my silly questions. — Ilya Afanasiev, Dec 05 '14 at 19:02
@Ilya Afanasiev I specifically referenced "the remaining part" after optimization strategies at the HLL level are exhausted. In my experience the performance advantage of a best-in-class hand-coded assembly language solution to a best-in-class compiled HLL solution is typically in the 1.1x to 1.25x range for compute-bound code. — njuffa, Dec 05 '14 at 19:32

score 2 · Accepted Answer · answered Dec 05 '14 at 19:00

2

Looking at the code you referred, the code is just a simple example for explanation but not practically usable because it does not consider other optimization factors. Optimizing the performance from this kind of examples has not been effective in my learning experience.

Of course, you can't see source code of cuBlas, but there are few open source projects including MAGMA with practical implementation of matrix multiplication. MAGMABLAS folder in MAGMA source code contains its implementation of BLAS and was helpful for me to learn how matrix multiplication can be practically implemented.

answered Dec 05 '14 at 19:00

Tae-Sung Shin

20,215
33
138
240

Thank you very much, I have forgotten that MAGMA is open source. If there is an implementation of GEMM for kepler, that will solve my problem. – Ilya Afanasiev Dec 05 '14 at 19:08
What is block size? Are you using blocking algorithm? – huseyin tugrul buyukisik Dec 05 '14 at 22:19

score -1 · Answer 2 · edited May 23 '17 at 10:28

-1

I'm not sure if this applies, but bank conflicts can decrease performance when using shared memory. There's a good explanation of this here: What is a bank conflict? (Doing Cuda/OpenCL programming)

edited May 23 '17 at 10:28

Community

1
1

answered Dec 05 '14 at 17:35

lightweight

36
4

Yes, thank you for your answer. I know what it is, and I'm sure that I'm avoiding bank conflicts. – Ilya Afanasiev Dec 05 '14 at 18:55

perfomance of CUDA matrix multiplication

2 Answers2