When running my application the very first cuda_malloc takes 40 seconds which is due to the initialization of the GPU. When I build in debug mode this reduces to 5 seconds and when I run the same code on a Fermi device, it takes far less than a second (not even worth measuring in my case).
Now the funny thing is that if I compile for this specific architecture, using the flag sm35 instead of sm20, it becomes fast again. As I should not use any new sm35 features just yet, how can I compile for sm20 and not have this huge delay? Also I am curious what is causing this delay? Is the machine code recompiled on the fly into sm35 code?
Ps. I run on windows but a colleague of mine encountered the same problem, probably on windows. The device is a Kepler, driver version 320.