1

I want to take weighted sum of two matrices in GPUarray to be fast. for example my code on cpu is given below:

mat1 = rand(19,19);

mat2= rand(19,19);

Receptive_fieldsize = [4,3]; 

overlap = 1;

Output = GetweightedSum(mat1,mat2, Receptive_fieldsize,overlap); %this will output in an 6x6 matrix

where as my function body is:

function Output = GetweightedSum(mat1,mat2, RF,overlap)

gap = RF(1) - overlap;
size_mat = size(mat1);
output_size=[6,6];
for u=1: output_size(1)
    for v=1: output_size(2)
        min_u = (u - 1) * gap + 1;
        max_u = (u - 1) * gap + RF(1);
        min_v = (v - 1) * gap + 1;
        max_v = (v - 1) * gap + RF(2);

       input1 = mat1(min_u:max_u,min_v:max_v);
       input2 = mat2(min_u:max_u,min_v:max_v); 
       Output(u,v) = sum(sum(input1 .*input2));

   end
end

How can i convert it to GPUfunciton. Can i do it directly, OR can i use for loop in GPU code. I am totally new to GPU so don't know anything about it. Will be thankful if some one guid me, or change the above code as reference to GPU function so that i may learn from it. Regards

khan
  • 531
  • 6
  • 29

1 Answers1

1

See if the codes and the comments alongside them make sense to you -

function Output = GetweightedSumGPU(mat1,mat2, RF,overlap)

%// Create parameters
gap = RF(1) - overlap;
output_size=[6,6];
sz1 = output_size(1);
sz2 = output_size(2);

nrows = size(mat1,1); %// get number of rows in mat1

%// Copy data to GPU
gmat1 = gpuArray(mat1);
gmat2 = gpuArray(mat2);

start_row_ind = gpuArray([1:RF(1)]'); %//' starting row indices for each block
col_offset = gpuArray([0:RF(2)-1]*nrows); %// column offset for each block

%// Linear indices for each block
ind = bsxfun(@plus,start_row_ind,col_offset);

%// Linear indices along rows and columns respectively
ind_rows = bsxfun(@plus,ind(:),[0:sz1-1]*gap);
ind_rows_cols = bsxfun(@plus,ind_rows,permute([0:sz2-1]*gap*nrows,[1 3 2]));

%// Elementwise multiplication, summing and gathering back result to CPU
Output = gather(reshape(sum(gmat1(ind_rows_cols).*gmat2(ind_rows_cols),1),sz1,sz2));

return;
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • let me try it. and will come to you in a while – khan Sep 17 '14 at 13:01
  • 1
    @khan I tested it out on my system with GPU and doesn't appear to be winning against the CPU code, but I guess this could be a learning experience for you. I think the problem is you are not making GPU do enough work. – Divakar Sep 17 '14 at 13:02
  • I agree, gpuendtime = 0.0083 CPUtimetaken = 0.0578 with me the result is something like this… But yes i am trying to just start with it. There are more and more images don't you think if i have 1000 images this 0.05 will affect my end result ? Or should i have to rearrange my code more, By the way thanks a lot for it. – khan Sep 17 '14 at 13:17
  • 1
    @khan Is that image data in `mat1` and `mat2`? One of the big overheads is copying data to GPU. If so, I think you need to do that copying just once? Also, try benchmarking by doing that copying before calling the GPU function and inputting that gpuArray data into the GPU function. – Divakar Sep 17 '14 at 13:19
  • you mean directly sending gmat1 and gmat2 instead of mat1 and mat2? – khan Sep 17 '14 at 13:21
  • 1
    @khan yes. That is do - `gmat1 = gpuArray(mat1); gmat2 = gpuArray(mat2);` and then call the function like this - `OutputGPU = GetweightedSumGPU(gmat1,gmat2, Receptive_fieldsize,overlap);`. Also, comment out the lines - ` `gmat1 = gpuArray(mat1); gmat2 = gpuArray(mat2);` inside the GPU function and edit this line - `nrows = size(gmat1,1);` and edit the function syntax - `function Output = GetweightedSumGPU(gmat1,gmat2, RF,overlap)` – Divakar Sep 17 '14 at 13:23
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/61416/discussion-between-khan-and-divakar). – khan Sep 17 '14 at 13:24
  • @khan There were some previous MATLAB+GPU related solutions that could be useful for learning purposes - http://stackoverflow.com/a/25470560/3293881 http://stackoverflow.com/a/25020990/3293881 http://stackoverflow.com/a/25162350/3293881 http://stackoverflow.com/a/25386429/3293881 http://stackoverflow.com/a/25288227/3293881 – Divakar Sep 17 '14 at 13:44
  • @khan BTW Could you run this and let me know the output which would be the runtimes with CPU and GPU codes respectively - `N = 999;mat1 = rand(N);mat2= rand(N); Receptive_fieldsize = [17,17]; overlap = 1; tic,OutputCPU = GetweightedSum(mat1,mat2, Receptive_fieldsize,overlap);toc gmat1 = gpuArray(mat1);gmat2 = gpuArray(mat2); tic,OutputGPU = GetweightedSumGPU(gmat1,gmat2, Receptive_fieldsize,overlap);toc` – Divakar Sep 17 '14 at 15:41
  • yeah why not, wait a minute – khan Sep 17 '14 at 16:11
  • Elapsed time is 0.089908 seconds. Elapsed time is 0.006398 seconds. This is the result…. – khan Sep 17 '14 at 16:21
  • I am trying to understand how you converted and used the matrix, But when i try to change my other code according to that, i am unable. I am also trying to read your other posts, but can't get idea, how can i do it. Should i post another question or should i add it here? – khan Sep 18 '14 at 15:09
  • @khan If you are working with other code, it would be better I think to post it as a new question. – Divakar Sep 18 '14 at 15:38
  • With due respect, the code for gpu that you provided has some error, as when i run both code, with and without GPu, it has different results. One point to be noted is that my Input size have different rows and sizes other than 19x19 i.e. 64x48. May be this is the reason. FOr which i did some changes in the code but even than it has different answer in output. i tried the following changes : ncols = size(gmat1,2); col_offset = gpuArray([0:RF(1)-1]*ncols); %// column offset for each block ind_rows_cols = bsxfun(@plus,ind_rows,permute([0:sz2-1]*gap*ncols,[1 3 2])); – khan Sep 22 '14 at 12:57
  • @khan Use this to see the error - `error = max(OutputCPU(:)-OutputGPU(:))<1e-5 %// If errror is 1, it means no error`. Basically, don't expect exactly matching results, mainly because GPU code is doing everything in one go, whereas CPU in iterations. – Divakar Sep 23 '14 at 13:33
  • @khan error is a binary number here because I am comparing against `1e-5` and so it can't be `2.0014`. Check again? Saying it again - `error = max(OutputCPU(:)-OutputGPU(:))<1e-5`. So `error = 1` means there is no error and `error = 0` means there is error. – Divakar Sep 23 '14 at 14:31
  • @khan This is what I tried - http://pastebin.com/CyDymspE . Can you check if I am doing anything wrong there? – Divakar Sep 23 '14 at 14:37
  • sorry for being late, actually what i did is like this to be more specific. error = max(temp_unactivated1(:)-temp_unactivated(:)); if(error < 1e-5) %// If errror is 1, it means no error disp('errror is 1, it means no error'); but it does not display the message inside the If statement. end – khan Sep 23 '14 at 15:02
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/61769/discussion-between-khan-and-divakar). – khan Sep 23 '14 at 15:09