0

I have written a piece of code in MatLab (2018a) that is a hybrid between standard matlab code and CUDA code, which I have linked using compilation with mexcuda. The core loop in my code contains an interpolation of a matrix, say from size [n x m] to [N x M]. I have sped up this part using the GPU. Since this interpolation is within a loop, and since the sizes of the matrices I interpolate (both before and after) are the same in each iteration of the loop, I want to speed up the application by preallocating an array of the output size on the GPU. So I want to do something like: zeros(N,M,'gpuArray') once at the start, provide it as input to the mexFunction, and write the interpolated matrix to this array. This would save quite a bit of allocation time ( [N_iterations-1]*allocation_time, roughly speaking ).

My problem now is: I cannot figure out if this is possible. Using a mexFunction() as the entry point, the only way I know to retrieve input arrays is using something along the lines of:

mxGPUArray const *in  = mxGPUCreateFromMxArray(prhs[0]);
float const *dev_in  = (float const *)(mxGPUGetDataReadOnly(in));

but, as the name suggests, this results in read-only permission. I cannot use mxGPUGetData(in) because the mxGPUArray is const, one cannot initialise a non-const entity with it. Does anyone know if there is a way around this issue that does not involve allocation of the array inside the mexFunction?

EDIT:

The code below shows two C code samples, where the first is an analogy for my current code, and the second is what I am striving for:

Current:

#include "stdio.h"
int main(const int argc, const char *argv[]) {
// Allocate input matrix and fill from input arguments
FILE *fPtr; fPtr = fopen(argv[1],"rb");
double *mat_in = malloc(n*m*sizeof(*mat_in));
mat_in = fread(mat_in, sizeof(*mat_in), n*m, fPtr);
fclose(fPtr);

double *mat_out;
for (int it = 0, it < 1000, it++) {
    // Allocate output array and fill it;
    mat_out = malloc(N*M*sizeof(*mat_out));
    interpolation_function(mat_in, mat_out);

    // Do stuff with mat_out
    free(mat_out);
}
// Free mat_in, do more stuff and/or exit program

Idea:

#include "stdio.h"
int main(const int argc, const char *argv[]) {
// Allocate input matrix and fill from input arguments
FILE *fPtr; fPtr = fopen(argv[1],"rb");
double *mat_in = malloc(n*m*sizeof(*mat_in));
mat_in = fread(mat_in, sizeof(*mat_in), n*m, fPtr);
fclose(fPtr);

// Allocate output array once at the start:
double *mat_out = malloc(N*M*sizeof(*mat_out));

for (int it = 0, it < 1000, it++) {
    interpolation_function(mat_in, mat_out); // Fills mat_out
    // Do stuff with mat_out here;
}
free(mat_out);
// Free mat_in, do more stuff and/or exit program

The above two are (at least in my mind) an analogy for the following matlab-cuda hybrid code:

Current (matlab); the mexcuda function needs to allocate memory for the interpolation of input(:,:,indx)

accumresult = zeros(N,M);
input = randn(100,100,1000);
for indx = 1:1000
    input_slice = mexcuda_interpolation( input(:,:,indx) );
    accumresult = accumresult + foo( input_slice, other_parameters);
end

Idea: the memory allocation is moved out of the mexcuda function (and thus, out of the core loop), and the mexcuda function only needs to retrieve the pointer to this (writeable) array;

accumresult = zeros(N,M,'gpuArray');
placeholder = zeros(N,M,'gpuArray'); % Memory allocated on GPU once here
input = randn(100,100,1000);
for indx = 1:1000
    accumresult = accumresult + foo( mexcuda_interpolation(input(:,:,indx)), placeholder, other_parameters);
    %mexcuda_interpolation() somehow gets a pointer to the allocated memory which it can write to
end

Note that there is indeed a possibility to parallelise this further: as stated, I am in an intermediate step of parallelising the entire thing.

Floris
  • 508
  • 2
  • 9
  • Wait, so you want to preallocate an array so you dont allocate it later? You need to do something with the output no? You can not just overwrite it. I am missing where are you saving time with this approach – Ander Biguri Jun 07 '18 at 08:48
  • Additionally, a read only GPU array, at least in CUDA, does not need time for preallocation. If you are going to write on it, it does not need to be "zeros", as you are going to write on it. I am not sure how MATLAB handles that, but if its is smart (it should) it would not initialize the output with zeroes, just give you a pointer to memory. "Premature optimization is the root of all evil". Have you already seen this operation taking time in your code, or are you assuming it would? – Ander Biguri Jun 07 '18 at 08:54
  • Thanks for the reply, Ander. The point of the procedure I described, is that I have a 3-D data volume that is too large to store in its entirety. But since I only need one 2-D slice of this cube at any given time, a workaround in my case is to store a coarse version of the 3-D cube instead, and interpolate the slice I need in the given iteration of the core loop. The output of the interpolation (the NxM matrix) is used later in the iteration, but can be discarded after it is used. So essentially, what I want to do is to re-use the memory this NxM matrix occupied for the next NxM matrix. – Floris Jun 07 '18 at 10:52
  • That is a good answer to my first comment. What about my second comment? – Ander Biguri Jun 07 '18 at 10:58
  • I am not sure I understand you completely. You are very right about the fact that the initialisation to zeros is theoretically unnecessary, because a pointer to dynamically allocated memory of the same size (with garbage values still) would do the trick. If you would know a way to do that in Matlab I would definitely be interested, but as far as I know, the conventional way to pre-allocate memory in matlab is using zeros. To summarise: I want to pre-allocate this memory, call my mexcuda routine N times, and have the mexcuda routine overwrite the values in this array each time. – Floris Jun 07 '18 at 13:10
  • Continuation of the above: If I could wrap the whole thing in C code or a mexfunction, I would just allocate the array once, and then reuse the memory N_iterations times. I hope to do that in the future, but I am far from that point in the development process. So for now, I will have to call the CUDA routine N_iterations times from matlab, and each of the calls will require the memory to write to (preallocated in matlab already if possible, otherwise (and what I am currently doing) allocating it every time it is called) – Floris Jun 07 '18 at 13:11
  • Your logic is fundamentally flawed, either by bad code or bad understanding. You'd need to show the first to make sure its not that. – Ander Biguri Jun 07 '18 at 13:46

1 Answers1

1

For your mex code, use mxGPUCreateGPUArray, instead of mxGPUCreateFromMxArray for memory allocation without initialization.


About your MATLAB code: why are you preallocating? Understand the principles of what you are doing, because you need it to work with GPUs.

In MATLAB, if you don't preallocate, every time you append new data, what MATLAB does under the hood is: create new array with new size, copy data from smaller old array to new. Of course, this is disencouraged, as you are doing unnecessary copies all the time.

In CUDA, this is not possible. Dynamic arrays do not exist. Especially because whatever you are doing does not happen serially, in a for loop, it happens "at the same time". Therefore it is fundamental that you know the size of the output when you do an operation.

So when you have GPU arrays A and B and you operate f() on them, f needs to know the output size. If you are doing in MATLAB C=f(A,B), you do not need to preallocate C (in fact, with this example, neither do you, without GPU computing). MATLAB will be smart enough to do it for you.

So, either you need to see why in the following code preallocating C is a waste of time

A=rand(N,M); % or A=rand(N,M,'gpuArray');
B=rand(N,M); % or B=rand(N,M,'gpuArray');
C=A+B;

Or alternatively you have a code that looks like:

A=rand(N,M,'gpuArray');
B=rand(N,M,'gpuArray');
for ii=1:N
   for jj=1:M
       C(ii,jj)=A(ii,jj)+B(ii,jj);
   end
end

Which fundamentally means that you are not getting any benefit from GPU computing as you are separating the parallelizable elements.

Ander Biguri
  • 35,140
  • 11
  • 74
  • 120
  • Thanks for the elaborate reply, Ander. I ran the Visual Profiler on my CUDA-code before and observed that allocation of the output array, i.e. `cudaMalloc(&ptr, size)` took 2-3 ms. Since I know for sure that the matrices in each iteration will be the same size, and since I do not need the matrix anymore once I am beyond a certain point in my the current iteration, I figured that it would be beneficial to only allocate once, and reuse the same memory in each iteration. Let me write an edit in the original post to clarify further. – Floris Jun 12 '18 at 09:01
  • @FlorisSA Hold your thought. The only reason my answer is elaborate is because it needs to tackle multiple hypothetical cases of what you could be doing. In order to really get any help, you need to show a [mcve], or some small realistic demo of what you are really doing and how/why – Ander Biguri Jun 12 '18 at 09:03
  • Have a look at the code snippets I wrote and see if it makes sense to you, Ander. I would post the actual code if I were allowed to, but that is unfortunately not the case. – Floris Jun 12 '18 at 09:35
  • @FlorisSA If you use MATLAB, you accept that MATLAB does some things itself, if you want full control of memory, you need to write the code in CUDA. This makes sense no? Otherwise why would anyone ever want to write the code in CUDA? I still think that if MATLAB is smart enough, it will reuse `input_slice`, however as its propietary software, we can not know. If you want to make sure, write CUDA. – Ander Biguri Jun 12 '18 at 09:51
  • @FlorisSA this said, with smart use of the `mxGetPointer` and `mxGPUCreateGPUArray` and some other functions, you should be able to test this yourself. Make your `mex` function such as the output pointer is the same as the input pointer and write data on there – Ander Biguri Jun 12 '18 at 09:52
  • That makes sense and that is indeed the end goal. Better B-line towards the full thing in cuda then. Thanks for the help! – Floris Jun 12 '18 at 09:58