0

I have a 3D GPU array A with dimensions K x M x N and an int vector v of length M and want to construct 2D GPU arrays of the form

X = [A(:,1,v(1)), A(:,2,v(2)),..., A(:,M,v(M))] (depending on v)

in the most time-efficient way. Since all these are GPU arrays, I was wondering if there is a faster way to accomplish this than pre-allocating X and using the obvious for loop. My code needs to invoke several millions of these instances, so this becomes quite the bottleneck. Typical oders of magnitude would be K = 350 000, 2<=M<=15, N<=2000, if that matters.

EDIT: Here is a minimal working version of the original bottleneck code I am trying to improve. Conversion to the 3D array A has been commented out. Adjust the array size parameters as you see fit.

% generate test data:
K = 4000; M = 2; % N = 100

A_cell = cell(1,M);
s = zeros(1,M,'uint16');
for m=1:M
    s(m) = m*50; % defines some widths for the matrices in the cells
    A_cell{m} = cast(randi([0 1],K,s(m),'gpuArray'),'logical');
end
N = max(s,[],2);

% % A_cell can be replaced by a 3D array A of dimensions K x M x N:
% A = true(K,M,N,'gpuArray');
% for m=1:M
%     A(:,m,1:s(m)) = permute(A_cell{m},[1 3 2]);
% end

% bottleneck code starts here and has M = 2 nested loops:
I_true = true(K,1,'gpuArray');
I_01 = false(K,1,'gpuArray');
I_02 = false(K,1,'gpuArray');

for j_1=1:s(1)
    for j_2=1:s(2)

        v = [j_1,j_2];

        I_tmp = I_true;

        for m=1:M
            I_tmp = I_tmp & A_cell{m}(:,v(m));
        end

        I_02 = I_02 | I_tmp;
    end

    I_01 = I_01 | I_02;
end

Out = gather(I_01);

% A_cell can be replaced by 3D array A above
M.G.
  • 293
  • 1
  • 13
  • 2
    Creating new matrices from indexing old ones is slow. In your case, the only vectorizable chunk seems to be the `m` part, which really, wont make much difference if you need to do millions of times, particularly if your `v` are out of order indices. The time consuming part is reading non-adjacent values. That said, its really hard to help optimize code that is hidden, so if you actually need help with some particular code, you will need to show that particular code. Describing it is definetyl not enough – Ander Biguri Nov 17 '19 at 11:30
  • @AnderBiguri Thanks for the input, you are making a good point! I have added a minimal working example of the original code I am trying to improve in case I have been thinking about it the wrong way all long. – M.G. Nov 17 '19 at 17:20
  • 1
    My biggest suggestion is removing cells. Those are non optimized in MATLAB – Ander Biguri Nov 17 '19 at 17:22
  • @AnderBiguri: True but challenging in this situation, because each element has a different size. – Daniel Nov 17 '19 at 17:30
  • @AnderBiguri Indeed. That is why my starting point of the question was the 3D array instead of the cell. I now included the cell portion for completeness sake. On the other hand, I also did some benchmarks and couldn't find any peformance differences between executing the code with cell of 2D gpuArrays and a 3D gpuArray. I guess cell performance in MATLAB has improved over time, I am currently using R2019b. Or 3D gpuArrays are slower. – M.G. Nov 17 '19 at 17:30

2 Answers2

3

MATLAB allows you to index multiple dimensions at once. This allows you to give a linear indexing vector h which indexes both the second and third dimension at the same time:

% Some example data
k=2;
m=3;
n=4;
v=[2,3,1];
A=rand(k,m,n);
X=[A(:,1,v(1)),A(:,2,v(2)),A(:,3,v(3))]
%solution
h=sub2ind([m,n],[1:m],v);
Y=A(:,h)

Further reading: Linear indexing, logical indexing, and all that

Daniel
  • 36,610
  • 3
  • 36
  • 69
0

Regarding the code I posted above, it turns out to be faster to use a 2D gpuAarray instead of a 3D gpuArray in place of the cell. This allows for a very straighforward selection of the columns and vectorization of the farthest inside loop. More precisely:

% generate test data:
K = 4000; M = 2;

A_cell = cell(1,M); % this is given externally
s = zeros(1,M,'uint16');
for m=1:M
    s(m) = m*50; % defines some widths for the matrices in the cells
    A_cell{m} = cast(randi([0 1],K,s(m)),'logical'); % cell2mat doesn't work with cells of gpuArrays
end

% conversion of A_cell into an appropriate 2D array is straightforward:
A_hor = gpuArray(cell2mat(A_cell)); % horizontal concatenation of the cells

% bottleneck code starts here and has M = 2 nested loops:
I_01 = false(K,1,'gpuArray');
I_02 = false(K,1,'gpuArray');

t = [1,s]; t = t(1:M); % vector of the starting indices of the old matrices inside A_hor

for j_1=1:s(1)
    for j_2=1:s(2)

        j = [j_1,j_2];

        k = t-1+j; % vector of the positions of the needed columns

        I_02 = I_02 | all(A_hor(:,k),2);
    end

    I_01 = I_01 | I_02;
end

Out = gather(I_01);
M.G.
  • 293
  • 1
  • 13