Why does the first cudaMalloc take so much time and memory?

Question

For instance

int *p;
cudaMalloc(&p, sizeof(int));

will take around 20secs and my process will typically gain 650MB+ (though always a slightly different amount) in memory usage in task manager. GPU-Z also indicates an increase in dedicated memory usage of 200MB+ on my GPU.

Only happens with the first call to cudaMalloc
Does not matter if I call other CUDA functions before it, like cudaGetDevice
Does not happen in some other CUDA projects

I am using

Thrust, CUBLAS, cuRAND libraries
MSVC 2010 with NVCC
Nsight 3.0
CUDA 5.0

Why does this happen? What can be done?

Update: As mentioned in the comments below, this appears to stem from initialization (calling cudaFree(0) has the same effect). However, as to why it's so slow, perhaps it has something to do with the runtime errors - the following error occurs a good 30 times as the initialization line is hit:

First-chance exception at 0x74f0b727 in ...: Microsoft C++ exception: cudaError_enum at memory location 0x003ff9c4..
First-chance exception at 0x74f0b727 in ...: Microsoft C++ exception: cudaError_enum at memory location 0x003ff9c4..
First-chance exception at 0x74f0b727 in ...: Microsoft C++ exception: cudaError_enum at memory location 0x003ff9c4..
etc...

This still happens when I'm not allocating anything, like a solitary call to cudaFree(0); - no idea why...

You are observing the overhead of lazy initialition of the CUDA context for the current device, triggered by most CUDA API calls. However, this time shouldn't normally be anywhere near 20 seconds. You can trigger the initialization of the CUDA context by calling another CUDA API functions, such as cudaFree(0), prior to the first call to cudaMalloc(). — njuffa, May 14 '13 at 23:09
@njuffa - OK, but this doesn't explain why for some small projects I can still call `cudaMalloc` without any noticeable overhead at all — mchen, May 14 '13 at 23:36
And these lags have become very noticeable right after I installed MSVC 2010 premium and Nsight 3.0 - coincidence or culprit? — mchen, May 14 '13 at 23:38
Did you try `cudaFree(0)`? If so, what happened to the speed of the `cudaMalloc` call? CUDA context initialization typically should take less than a second, so if the first `cudaMalloc` call takes 20 seconds, something else is going on. Are you running the app under the control of the debugger? — njuffa, May 15 '13 at 00:53
Are you doing some [**proper error checking**](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api)? Is `cuda-memcheck` returning any useful error? The errors that you detected may very well be the key, but these are not very clear to say the least. — BenC, May 15 '13 at 01:45

Why does the first cudaMalloc take so much time and memory?

0 Answers0