Why do I have 'insufficient buffer space' when I put allocation code in a function?

Question

So I just started writing in CUDA, following the An Even Easier Introduction to CUDA guide. So far so good. Then I wanted to implement a neural network, which gave me quite some calls to the function cudaMallocManaged(). So, to make things more readable, I decided to put these in a different function called allocateStuff() (see code below). When I then run the program using nvprof this does not show the GPU time of layerInit() and instead gives the following warning:

Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

However, when I put the code in the allocateStuff() function directly in main(), the warning does not happen and the GPU time for layerInit() is shown. So now my question is: what do I do wrong in this function, or else what is the reason that it (apparently) overflows the buffer?

Code:

#include <cuda_profiler_api.h>
#include <iostream>
#include <vector>

__global__
void layerInit(const unsigned int firstNodes,
               const unsigned int secondNodes,
               const unsigned int resultNodes,
               float *firstLayer,
               float *secondLayer,
               float *resultLayer) {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (unsigned int i = index; i < firstNodes; i += stride) {
      firstLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < secondNodes; i += stride) {
      secondLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < resultNodes; i += stride) {
      resultLayer[i] = 0.0f;
   }
}

void allocateStuff(const unsigned int firstNodes,
                   const unsigned int secondNodes,
                   const unsigned int resultNodes,
                   float *firstLayer,
                   float *secondLayer,
                   float *resultLayer,
                   std::vector<float*> &firstWeightLayer,
                   std::vector<float*> &secondWeightLayer) {
   cudaMallocManaged(&firstLayer,  firstNodes  * sizeof(float));
   cudaMallocManaged(&secondLayer, secondNodes * sizeof(float));
   cudaMallocManaged(&resultLayer, resultNodes * sizeof(float));

   for (auto& nodeLayer : firstWeightLayer) {
      cudaMallocManaged(&nodeLayer, secondNodes * sizeof(float));
   }
   for (auto& nodeLayer : secondWeightLayer) {
      cudaMallocManaged(&nodeLayer, resultNodes * sizeof(float));
   }
}

template<typename T, typename... Args>
void freeStuff(T *t) {
   cudaFree(t);
}

template<typename T, typename... Args>
void freeStuff(T *t, Args... args) {
   freeStuff(&t);
   freeStuff(args...);
}

void freeStuff(std::vector<float*> &vec) {
   for (auto& v : vec) {
      freeStuff(&v);
   }
}

int main () {
   unsigned int firstNodes = 5, secondNodes = 3, resultNodes = 1;
   float *firstLayer = new float[firstNodes];
   float *secondLayer = new float[secondNodes];
   float *resultLayer = new float[resultNodes];
   std::vector<float*> firstWeightLayer(firstNodes, new float[secondNodes]);
   std::vector<float*> secondWeightLayer(secondNodes, new float[resultNodes]);

   allocateStuff(firstNodes, secondNodes, resultNodes,
                 firstLayer, secondLayer, resultLayer,
                 firstWeightLayer,secondWeightLayer);

   layerInit<<<1,256>>>(firstNodes,
                        secondNodes,
                        resultNodes,
                        firstLayer,
                        secondLayer,
                        resultLayer);

   cudaDeviceSynchronize();
   freeStuff(firstLayer, secondLayer, resultLayer);
   freeStuff(firstWeightLayer);
   freeStuff(secondWeightLayer);

   cudaProfilerStop();
   return 0;
}

Output of nvprof ./executable with function allocateStuff():

==18608== NVPROF is profiling process 18608, command: ./executable
==18608== Profiling application: ./executable
==18608== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==18608== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   96.20%  105.47ms        11  9.5884ms  5.7630us  105.39ms  cudaMallocManaged
      ...

Output of nvprof ./executable without said function:

==18328== NVPROF is profiling process 18328, command: ./executable
==18328== Profiling application: ./executable
==18328== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.2080us         1  2.2080us  2.2080us  2.2080us  layerInit(unsigned int, unsigned int, unsigned int, float*, float*, float*)
      API calls:   99.50%  114.01ms        11  10.365ms  4.9390us  113.95ms  cudaMallocManaged
      ...

Compilercall: nvcc -std=c++11 -g -o executable main.cu

score 3 · Accepted Answer · answered Nov 17 '19 at 16:53

Any time you are having trouble with a CUDA code, I recommend proper CUDA error checking. I suggest you implement that and check it before asking others for help. Even if you don't understand the error output, it will be useful for anyone else trying to help you.

If we add the following to the end of your code, without any other changes:
```
cudaError_t err = cudaGetLastError();  // add
if (err != cudaSuccess) std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl; // add
cudaProfilerStop();
return 0;
```
we get the following output:
```
CUDA error: an illegal memory access was encountered
```
With your allocation-in-function code realization, what is happening is that the CUDA kernel you have written is making illegal accesses.
The primary problem here is a C/C++ coding error. To pick one example, when you pass float *firstLayer, to allocateStuff(), you are passing firstLayer by value. That means that any modifications to the numerical value of firstLayer (i.e. the pointer value itself, such as what cudaMallocManaged is doing) will not show up in the calling function (i.e. will not be reflected in the value of firstLayer that is observed in main). This really has nothing to do with CUDA. If you passed a bare pointer to a function, and then allocated that pointer using e.g. malloc() that would be similarly broken.

Since we have C++ in view here, we will fix this by passing these pointers by reference instead of by value.
When creating managed allocations, its not necessary to first allocate the pointer using new as you have shown here. Furthermore, although it's not the source of any problem here, this is on the path to creating memory leaks in your program, so you shouldn't do that.
Not sure why you are using ampersands here:
```
freeStuff(&v);
```
and here:
```
freeStuff(&t);
```
as you strip off arguments to pass to cudaFree, you should be passing those directly, not the address of those arguments.

The following code has these issues addressed:

$ cat t1592.cu
#include <cuda_profiler_api.h>
#include <iostream>
#include <vector>

__global__
void layerInit(const unsigned int firstNodes,
               const unsigned int secondNodes,
               const unsigned int resultNodes,
               float *firstLayer,
               float *secondLayer,
               float *resultLayer) {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (unsigned int i = index; i < firstNodes; i += stride) {
      firstLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < secondNodes; i += stride) {
      secondLayer[i] = 0.0f;
   }
   for (unsigned int i = index; i < resultNodes; i += stride) {
      resultLayer[i] = 0.0f;
   }
}

void allocateStuff(const unsigned int firstNodes,
                   const unsigned int secondNodes,
                   const unsigned int resultNodes,
                   float *&firstLayer,
                   float *&secondLayer,
                   float *&resultLayer,
                   std::vector<float*> &firstWeightLayer,
                   std::vector<float*> &secondWeightLayer) {
   cudaMallocManaged(&firstLayer,  firstNodes  * sizeof(float));
   cudaMallocManaged(&secondLayer, secondNodes * sizeof(float));
   cudaMallocManaged(&resultLayer, resultNodes * sizeof(float));

   for (auto& nodeLayer : firstWeightLayer) {
      cudaMallocManaged(&nodeLayer, secondNodes * sizeof(float));
   }
   for (auto& nodeLayer : secondWeightLayer) {
      cudaMallocManaged(&nodeLayer, resultNodes * sizeof(float));
   }
}

template<typename T, typename... Args>
void freeStuff(T *t) {
   cudaFree(t);
}

template<typename T, typename... Args>
void freeStuff(T *t, Args... args) {
   freeStuff(t);
   freeStuff(args...);
}

void freeStuff(std::vector<float*> &vec) {
   for (auto& v : vec) {
      freeStuff(v);
   }
}

int main () {
   unsigned int firstNodes = 5, secondNodes = 3, resultNodes = 1;
   float *firstLayer; // = new float[firstNodes];
   float *secondLayer; // = new float[secondNodes];
   float *resultLayer; // = new float[resultNodes];
   std::vector<float*> firstWeightLayer(firstNodes, new float[secondNodes]);
   std::vector<float*> secondWeightLayer(secondNodes, new float[resultNodes]);

   allocateStuff(firstNodes, secondNodes, resultNodes,
                 firstLayer, secondLayer, resultLayer,
                 firstWeightLayer,secondWeightLayer);

   layerInit<<<1,256>>>(firstNodes,
                        secondNodes,
                        resultNodes,
                        firstLayer,
                        secondLayer,
                        resultLayer);

   cudaDeviceSynchronize();
   freeStuff(firstLayer, secondLayer, resultLayer);
   freeStuff(firstWeightLayer);
   freeStuff(secondWeightLayer);
   cudaError_t err = cudaGetLastError();
   if (err != cudaSuccess) std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl;
   cudaProfilerStop();
   return 0;
}
$ nvcc -o t1592 t1592.cu
$ cuda-memcheck ./t1592
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
[user2@dc10 misc]$ nvprof ./t1592
==23751== NVPROF is profiling process 23751, command: ./t1592
==23751== Profiling application: ./t1592
==23751== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  355.63us         1  355.63us  355.63us  355.63us  layerInit(unsigned int, unsigned int, unsigned int, float*, float*, float*)
      API calls:   96.34%  328.78ms        11  29.889ms  7.4380us  328.69ms  cudaMallocManaged
                    1.80%  6.1272ms       388  15.791us     360ns  1.7016ms  cuDeviceGetAttribute
                    1.46%  4.9900ms         4  1.2475ms  595.29us  3.0996ms  cuDeviceTotalMem
                    0.13%  444.60us         4  111.15us  97.400us  134.37us  cuDeviceGetName
                    0.10%  356.98us         1  356.98us  356.98us  356.98us  cudaDeviceSynchronize
                    0.10%  329.51us         1  329.51us  329.51us  329.51us  cudaLaunchKernel
                    0.06%  212.66us        11  19.332us  10.066us  74.953us  cudaFree
                    0.01%  27.695us         4  6.9230us  3.6950us  12.111us  cuDeviceGetPCIBusId
                    0.00%  8.7990us         8  1.0990us     453ns  1.7600us  cuDeviceGet
                    0.00%  6.2770us         3  2.0920us     368ns  3.8460us  cuDeviceGetCount
                    0.00%  2.6700us         4     667ns     480ns     840ns  cuDeviceGetUuid
                    0.00%     528ns         1     528ns     528ns     528ns  cudaGetLastError

==23751== Unified Memory profiling result:
Device "Tesla V100-PCIE-32GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       1         -         -         -           -  352.0640us  Gpu page fault groups
$

Notes:

Before running any CUDA profiler, make sure your code is free of any runtime errors reported by CUDA. The minimal error checking above combined with use of cuda-memcheck are good things to do.
I've not really tried to determine if there are any potential issues with firstWeightLayer or secondWeightLayer. They are not contributing any runtime errors, but depending on how you attempt to use them, you may run into trouble. Since there is no evidence of how you will use them, I'll leave it at that.

Thanks a bunch for the hints you added! As for the references, I thought that because the layerInit() function did not require `*&` that the allocateStuff() function would also work with just `*`. The rest was just trying to fix it and then forgetting to undo it I guess :P. — Simon Klaver, Nov 17 '19 at 17:20

Why do I have 'insufficient buffer space' when I put allocation code in a function?

1 Answers1