16

I'm looking for a very bare bones matrix multiplication example for CUBLAS that can multiply M times N and place the results in P for the following code, using high-performance GPU operations:

float M[500][500], N[500][500], P[500][500];
for(int i = 0; i < Width; i++){
    for(int j = 0; j < Width; j++)
    {
        M[i][j] = 500;
        N[i][j] = 500;
        P[i][j] = 0;
    }
}

So far, most code I'm finding to do any kind of matrix multiplication using CUBLAS is (seemingly?) overly complicated.

I am attempting to design a basic lab where students can compare the performance of matrix multiplication on the GPU vs matrix multiplication on the CPU, presumably with increased performance on the GPU.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Chris Redford
  • 16,982
  • 21
  • 89
  • 109
  • 1
    Do you consider the simpleCublas example in the CUDA SDK to be "overly complicated"? – talonmies Oct 04 '11 at 08:04
  • 1
    Yes. I mean, if that's as simple as it gets, I guess we just have to deal with it. I was just hoping there would be some kind of code with an obvious CPU equivalent such that we could time both and compare the results. – Chris Redford Oct 04 '11 at 12:25
  • I'm the GTA for a 500-level Data Structures class. So we are already pumping them so full of details for programming various trees, heaps, and other data structures as well as relevant C++ and experimentation conventions that having them learn that many syntactic details for CUBLAS would really be out of the scope of relevant information for the class. – Chris Redford Oct 04 '11 at 12:32
  • CUBLAS linear algebra calls themselves only follow the same syntax/API as the [standard BLAS](http://netlib.org/blas/), which is absolutely the defacto linear algebra API and library and has been since the 1980s when it was written. Using the GPU implies using a system with a non-uniform memory space, and so it incurs some additional API overhead. So if you would consider either to be beyond the upper limit of what you are trying to teach, then I think you are out of luck. – talonmies Oct 04 '11 at 13:14
  • Okay. Thanks for the background info. I'll keep looking around. I may need to ask a more general question on SO. All I need is just SOME example, simple as possible, that I can show the GPU outperforming the CPU on any kind of algorithmic task, using CUDA. – Chris Redford Oct 04 '11 at 15:57

2 Answers2

8

The SDK contains matrixMul which illustrates the use of CUBLAS. For a simpler example see the CUBLAS manual section 1.3.

The matrixMul sample also shows a custom kernel, this won't perform as well as CUBLAS of course.

benathon
  • 7,455
  • 2
  • 41
  • 70
Tom
  • 20,852
  • 4
  • 42
  • 54
1

CUBLAS is not necessary to show the GPU outperform the CPU, though CUBLAS would probably outperform it more. It appears that many straightforward CUDA implementations (including matrix multiplication) can outperform the CPU if given a large enough data set, as explained and demonstrated here:

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

Community
  • 1
  • 1
Chris Redford
  • 16,982
  • 21
  • 89
  • 109