Matlab CUDA basic experiment

Question

(correctly and instructively asnwered, see below)

I'm beginning to do experiments with matlab and gpu (nvidia gtx660).

Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:

function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
 for i=1:n

   if norm(A(i,:))<1
    countr=countr+1;
   end
 end
pig=(countr/n)*4;
end

This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:

   >> tic; mc1vecnocuda(100000);toc;

      Elapsed time is 0.092473 seconds.

See, instead, what happens with gpu-ized version of the algorithm:

   function pig = mc1veccuda(n)
   countr=0;
   gpucountr=gpuArray(countr);
   A=gpuArray.rand(n,2);
   parfor (i=1:n,1024)
    if norm(A(i,:))<1
        gpucountr=gpucountr+1;
    end
   end

   pig=(gpucountr/n)*4;
   end

Now, this takes a LONG time to be executed:

>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.

I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.

Can someone help me? Thanks.

Edit: this is the updated version that avoids IF:

function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)

    gpucountr = gpucountr+nnz(norm(A(i,:))<1);

end

pig=(gpucountr/n)*4;
end

And this is the code written following Bichoy's guidelines (the right code to achieve result):

function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);

pig=(gpucountr/n)*4;
end

Please note execution time with n=10 millions:

>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.

I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.

Great Bichoy! ;)

I'm pretty sure that something is moved from device to host and back again, during the execution of the for loop.. There is any way to watch if (and when) a transfer device <--> host occurs?? — MadHatter, Apr 13 '13 at 14:14
This is probably not the best way to do it. The whole thing could be vectorized as: `pig = nnz(sum(X.^2,2)<=1) * 4 / size(X,1)` where `X=rand(N,2)` or similar array on the gpu — Amro, Apr 13 '13 at 22:14
off-topic, but here is a nice animation of this simulation: http://pastebin.com/w0N46vJE :) Similar to: http://en.wikipedia.org/wiki/File:Pi_30K.gif — Amro, Apr 13 '13 at 22:19

Bichoy · Accepted Answer · 2013-04-14T16:01:03.100

3

I guess the problem is with parfor!

parfor is supposed to run on MATLAB workers, that is your host not the GPU! I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.

Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.

Code:

This is the final code after including all corrections and suggestions from several people:

function pig = mc2veccuda(n)
  A=gpuArray.rand(n,2); % An nx2 random matrix
  Asq = A.^2; % Get the square value of each element
  Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
  gpucountr = nnz(Anorm<1); % Check the number of elements < 1
  pig=(gpucountr/n)*4;

edited Apr 14 '13 at 16:01

answered Apr 13 '13 at 17:19

Bichoy

351
2
9

You are great, and gave a serious lesson to me. I suppose you left code a bit messed for educational purpose, see my original post edit for further details! Thank you so much, by your answer I started to learn how vectorize code the right way ;) – MadHatter Apr 13 '13 at 23:50
I am really glad it helped you. This is the whole purpose of this form. Sorry of my code lacks educational comments, I will try to edit it to make it clearer, I replied in a hurry. Thank you for your nice words. – Bichoy Apr 14 '13 at 00:00
BTW, feel free to suggest any edits to my post as you see fit to make it more clear. If you can't make edits just add them in comments here and I will adapt it. – Bichoy Apr 14 '13 at 00:06
I corrected my answer as you posted in your final question. Thanks again :) – Bichoy Apr 14 '13 at 00:21
As you noticed, other guys were unable to suggest the right path. This is not cuase they are bad programmers, but because we all are educated to think in terms of traditional loops. As you rightly suggested, parallel programming is a whole other thing, and one has to recast his way of thinking ;) – MadHatter Apr 14 '13 at 10:28
I'm afraid row 5 of your code doesn't get gpucountr incremented: it will be switched between 0 and 1 until the execution finishes... Right? – MadHatter Apr 14 '13 at 19:34
No, it will just get the count of non-zero elements. Since you don't have a for-loop, no need to increment/add. `nnz` will actually count the non-zero elements and assign it to `gpucountr` directly. Try it :) – Bichoy Apr 14 '13 at 19:37

Nishanth · Answer 2 · 2013-04-13T14:49:17.093

1

Many reasons like:

Movement of data between host & device
Computation within each loop is very small
Call to rand on GPU may not be parallel
if condition within the loop can cause divergence
Accumulation to a common variable may run in serial, with overhead

It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.

edited Apr 13 '13 at 14:49

answered Apr 13 '13 at 14:28

Nishanth

6,932
5
26
38

Thanks for your reply. According to this post: http://stackoverflow.com/questions/9054949/matlab-if-statements-with-cuda?rq=1 you are right. Now, I modified my code, using nnz function intead of IF, and r2013a supports nnz on gpuArray. I got same horrible result, so the bottleneck has to be in the FOR loop. I hardly understandable reasons, it appears that data are moved back and forth for each iteration, contrarily to my beliefs. – MadHatter Apr 13 '13 at 14:39
can you update your post with new numbers? I am curious to know. – Nishanth Apr 13 '13 at 14:42
the loop body is very small and you are accumulating in same variable (which is difficult to parallelize). Check if there are any built-in reduction operations. – Nishanth Apr 13 '13 at 14:47
Sure: >> tic; mc2veccuda(100000); toc; Elapsed time is 19.631224 seconds. – MadHatter Apr 13 '13 at 14:47
Note I am accumulating in a variable residing in gpu memory, but I understand it is howver difficult to parallelize. Are you proposing to accumulate in chunks and then summing up?? Please be more specific, and if possible provide code snippets.. Thanks a lot! ;) – MadHatter Apr 13 '13 at 14:52
I think its very difficult to get fine-grained control using Matlab. There are many examples in CUDA-SDK for doing reduction operation. First perform summation within each thread. Then recursively sum consecutive threads. Along with efficient use of registers and local memory / shared memory. – Nishanth Apr 13 '13 at 15:00
I'm quite crude on C/CUDA programming, to be honest... However, many thanks again, I'll study and I'll try to do something with C-CUDA and mex interface... – MadHatter Apr 13 '13 at 15:10

score 1 · Answer 3 · answered Jan 15 '16 at 17:19

As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.

Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html

Matlab CUDA basic experiment

3 Answers3

Code: