4

MATLAB provides functions for preallocation/initializing arrays with common values like 0 or 1. However, if we want the array to have some arbitrary double value, there are various ways to do it, and it's not obvious which one is preferable.

This problem is not new - it was previously discussed in places such as this blog post and in this answer. However, experience shows that software (specifically, MATLAB and its execution engine) and hardware change with time, and so it is possible that the best approach would be different on different systems. Unfortunately, the aforementioned sources do not provide benchmarking code, which might be the ultimate (and timeless) way to answer this question.

I'm looking for a benchmark I could run that would tell me the fastest approach to use on my system, considering that I might be using both "regular" double arrays and gpuArray double arrays, of various sizes.

Dev-iL
  • 23,742
  • 7
  • 57
  • 99
  • I've pretty sure I've seen this before. Did you just copy the question and answer from another existing post? – Durkee Feb 05 '19 at 16:10
  • @Durkee Absolutely not - I wrote this myself just now. There are various benchmarking questions on different topics related to MATLAB. You might be confusing it with something else. I did make an effort to find a similar question before posting, and I couldn't find something exactly like it. Of course, if you could point me to a sufficiently similar post, I'd be happy to move my answer there (or delete it altogether, if it provides nothing new). – Dev-iL Feb 05 '19 at 16:12
  • Oh sorry, I just remembered the CPU version of the post. You might want to consider a version with single precision though. It's generally not a good idea to use double precision on GPUs. – Durkee Feb 05 '19 at 16:15
  • @Durkee got a link? Indeed, there are speed benefits to running GPU computations using `single` precision, but for my purposes it is irrelevant as I need the extra precision. Do you suspect that the benchmark will produce different results (in terms of the best method) if using `single` precision? If so - all the more reason to have a benchmarking code. – Dev-iL Feb 05 '19 at 16:22
  • @Dev-iL The memory allocation won't be different for single or double (well, it will be x2). The computation though, will be more than x2 slower in GPUs using doubles unless you are using the Tesla family of GPUs, as almost all GPU processors are 32bit. – Ander Biguri Feb 05 '19 at 16:53
  • Seems pretty similar to [this question/answer](https://stackoverflow.com/a/18217986/3978545)? Although there's a method under discussion there which you've not benchmarked. Also [obligatory UndocumentedMatlab post](https://undocumentedmatlab.com/blog/preallocation-performance) – Wolfie Feb 05 '19 at 17:13
  • @Wolfie I suspect you haven't thoroughly read my question. I say this because a) I linked to the same blog post; b) I'm interested in initializing nonzero arrays; c) the results in the linked Q&A are from an old MATLAB version (i.e. likely irrelevant to the present) and were timed using `tic toc` (which is known to be unreliable); and d) have no mention of `gpuArrays`. However, I can add the link (and this explanation) into my question if you feel it would improve its raison d'être. – Dev-iL Feb 05 '19 at 18:46

1 Answers1

7
function allocationBenchmark(arrSz)
if nargin < 1
  arrSz = 1000;
end

%% RAM
t = [];
disp('--------------- Allocations in RAM ---------------')
t(end+1) = timeit(@()v1(arrSz), 1);
t(end+1) = timeit(@()v2(arrSz), 1);
t(end+1) = timeit(@()v3(arrSz), 1);
t(end+1) = timeit(@()v4(arrSz), 1);
t(end+1) = timeit(@()v5(arrSz), 1);
t(end+1) = timeit(@()v6(arrSz), 1);
t(end+1) = timeit(@()v7(arrSz), 1);
t = 1E3 * t; % conversion to msec
disp(t); disp(" ");
[~,I] = min(t);
disp("Conclusion: method #" + I + " is the fastest on the CPU!"); disp(" ");

%% VRAM
if gpuDeviceCount == 0, return; end
t = [];
disp('--------------- Allocations in VRAM --------------')
t(end+1) = NaN; % impossible (?) to run v1 on the gpu
t(end+1) = gputimeit(@()v2gpu(arrSz), 1);
t(end+1) = gputimeit(@()v3gpu(arrSz), 1);
t(end+1) = gputimeit(@()v4gpu(arrSz), 1);
t(end+1) = gputimeit(@()v5gpu(arrSz), 1);
t(end+1) = gputimeit(@()v6gpu(arrSz), 1);
t(end+1) = gputimeit(@()v7gpu(arrSz), 1);
t = 1E3 * t; % conversion to msec
disp(t); disp(" ");
[~,I] = min(t);
disp("Conclusion: method #" + I + " is the fastest on the GPU!");

end

%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% RAM
function out = v1(M)
% Indexing on the undefined matrix with assignment:
out(1:M, 1:M) = pi;
end

function out = v2(M)
% Indexing on the target value using the `ones` function:
scalar = pi;
out = scalar(ones(M));
end

function out = v3(M)
% Using the `zeros` function with addition:
out = zeros(M, M) + pi;
end

function out = v4(M)
% Using the `repmat` function:
out = repmat(pi, [M, M]);
end

function out = v5(M)
% Using the ones function with multiplication:
out = ones(M) .* pi;
end

function out = v6(M)
% Default initialization with full assignment:
out = zeros(M);
out(:) = pi;
end

function out = v7(M)
% Using the `repelem` function:
out = repelem(pi,M,M);
end

%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% VRAM
function out = v2gpu(M)
scalar = gpuArray(pi);
out = scalar(gpuArray.ones(M));
end

function out = v3gpu(M)
out = gpuArray.zeros(M, M) + gpuArray(pi);
end

function out = v4gpu(M)
out = repmat(gpuArray(pi), [M, M]);
end

function out = v5gpu(M)
out = gpuArray.ones(M) .* gpuArray(pi);
end

function out = v6gpu(M)
% Default initialization with full assignment:
out = gpuArray.zeros(M);
out(:) = gpuArray(pi);
end

function out = v7gpu(M)
% Using the `repelem` function:
out = repelem(gpuArray(pi),M,M);
end

Running the above (e.g. with an input of 5000) results in the following:

--------------- Allocations in RAM ---------------
  110.4832  328.1685   48.7895   47.9652  108.8930   93.0481   47.9037

Conclusion: method #7 is the fastest on the CPU!

--------------- Allocations in VRAM --------------
       NaN   37.0322   17.9096   14.2873   17.7377   16.1386   16.6330

Conclusion: method #4 is the fastest on the GPU!

... which tells us the best (or equivalent) method to use in every case.

Dev-iL
  • 23,742
  • 7
  • 57
  • 99
  • If you have the image processing toolbox [padarray](https://www.mathworks.com/help/images/ref/padarray.html) is an alternative: `padarray(pi,[M,M]-1,pi,'post');` – rahnema1 Feb 05 '19 at 16:48
  • I suggested it so your list of methods becomes more complete. – rahnema1 Feb 05 '19 at 17:08
  • 1
    @rahnema1 I've just checked, so `padarray` calls `padarray_algo` which calls `mkconstarray` which is nothing but `repmat` - so this means that it is just a slower version of `v4` (due to all the extra steps like the input validation of `padarray`). I have benchmarked it locally and my suspicion was confirmed - it performs slightly worse than `v4`, by a small margin - about the same as `v3`. So since it's just a wrapper around one of the existing methods (and it additionally requires a toolbox), I think there's no reason to add it to the benchmark - would you agree? – Dev-iL Feb 05 '19 at 21:41