Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.