8

How can I create a CUDA context? The first call of CUDA is slow and I want to create the context before I launch my kernel.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Arkerone
  • 1,971
  • 2
  • 22
  • 35

2 Answers2

15

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context.

EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are slightly different and cudaSetDevice() itself establishes context on the device is it called on. So the explicit cudaFree(0) call is no longer necessary (although it won't hurt anything).

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • As of CUDA 12.0, the `cudaSetDevice()` performs eager initialization. Looks like `cudaFree(0)` is no longer needed. – biubiuty Jan 29 '23 at 05:16
3

Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work.

I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.

ChrisV
  • 3,363
  • 16
  • 19
  • I use the librairy openCV and the first call is slow. I can choose the device in my application but i would like init the context of Cuda in launch of application. I try cudaDeviceSynchronize but don't work – Arkerone May 02 '12 at 14:20
  • Are you sure it's actually context creation in that case? That's pretty fast on most hardware. OpenCV might (guessing here) be doing a large memcpy, and a preinitialized context won't help there. – ChrisV May 02 '12 at 15:14
  • 1
    In opencv FAQ : "That is because of initialization overheads. On first GPU function call Cuda Runtime API is initialized implicitly. Also some GPU code is compiled (Just In Time compilation) for your video card on the first usage. So for performance measure, it is necessary to do dummy function call and only then perform time tests. If it is critical for an application to run GPU code only once, it is possible to use a compilation cache which is persistent over multiple runs. Please read nvcc documentation for details (CUDA_DEVCODE_CACHE environment variable). " – Arkerone May 02 '12 at 15:19