I have written a piece of code in MatLab (2018a) that is a hybrid between standard matlab code and CUDA code, which I have linked using compilation with mexcuda. The core loop in my code contains an interpolation of a matrix, say from size [n x m] to [N x M]. I have sped up this part using the GPU.
Since this interpolation is within a loop, and since the sizes of the matrices I interpolate (both before and after) are the same in each iteration of the loop, I want to speed up the application by preallocating an array of the output size on the GPU. So I want to do something like: zeros(N,M,'gpuArray')
once at the start, provide it as input to the mexFunction, and write the interpolated matrix to this array. This would save quite a bit of allocation time ( [N_iterations-1]*allocation_time, roughly speaking ).
My problem now is: I cannot figure out if this is possible. Using a mexFunction() as the entry point, the only way I know to retrieve input arrays is using something along the lines of:
mxGPUArray const *in = mxGPUCreateFromMxArray(prhs[0]);
float const *dev_in = (float const *)(mxGPUGetDataReadOnly(in));
but, as the name suggests, this results in read-only permission. I cannot use mxGPUGetData(in)
because the mxGPUArray is const
, one cannot initialise a non-const entity with it. Does anyone know if there is a way around this issue that does not involve allocation of the array inside the mexFunction?
EDIT:
The code below shows two C code samples, where the first is an analogy for my current code, and the second is what I am striving for:
Current:
#include "stdio.h"
int main(const int argc, const char *argv[]) {
// Allocate input matrix and fill from input arguments
FILE *fPtr; fPtr = fopen(argv[1],"rb");
double *mat_in = malloc(n*m*sizeof(*mat_in));
mat_in = fread(mat_in, sizeof(*mat_in), n*m, fPtr);
fclose(fPtr);
double *mat_out;
for (int it = 0, it < 1000, it++) {
// Allocate output array and fill it;
mat_out = malloc(N*M*sizeof(*mat_out));
interpolation_function(mat_in, mat_out);
// Do stuff with mat_out
free(mat_out);
}
// Free mat_in, do more stuff and/or exit program
Idea:
#include "stdio.h"
int main(const int argc, const char *argv[]) {
// Allocate input matrix and fill from input arguments
FILE *fPtr; fPtr = fopen(argv[1],"rb");
double *mat_in = malloc(n*m*sizeof(*mat_in));
mat_in = fread(mat_in, sizeof(*mat_in), n*m, fPtr);
fclose(fPtr);
// Allocate output array once at the start:
double *mat_out = malloc(N*M*sizeof(*mat_out));
for (int it = 0, it < 1000, it++) {
interpolation_function(mat_in, mat_out); // Fills mat_out
// Do stuff with mat_out here;
}
free(mat_out);
// Free mat_in, do more stuff and/or exit program
The above two are (at least in my mind) an analogy for the following matlab-cuda hybrid code:
Current (matlab); the mexcuda function needs to allocate memory for the interpolation of input(:,:,indx)
accumresult = zeros(N,M);
input = randn(100,100,1000);
for indx = 1:1000
input_slice = mexcuda_interpolation( input(:,:,indx) );
accumresult = accumresult + foo( input_slice, other_parameters);
end
Idea: the memory allocation is moved out of the mexcuda function (and thus, out of the core loop), and the mexcuda function only needs to retrieve the pointer to this (writeable) array;
accumresult = zeros(N,M,'gpuArray');
placeholder = zeros(N,M,'gpuArray'); % Memory allocated on GPU once here
input = randn(100,100,1000);
for indx = 1:1000
accumresult = accumresult + foo( mexcuda_interpolation(input(:,:,indx)), placeholder, other_parameters);
%mexcuda_interpolation() somehow gets a pointer to the allocated memory which it can write to
end
Note that there is indeed a possibility to parallelise this further: as stated, I am in an intermediate step of parallelising the entire thing.