1

I have a cuda kernel which works fine when called from a single CPU threads. However when the same is called from multiple CPU threads (~100), most of the kernel seems not be executed at all as the results comes out to be all zeros.Can someone please guide me how to resolve this problem?

In the current version of kernel I am using a cudadevicesynchronize() at the end of kernel call. Will adding a sync command before cudaMalloc() and kernel call be of any help in this case?

There is another thing which need some clarification. i.e. If two CPU threads executes the same cudaMalloc() command, will the later overwrite the former in GPU memory or will they create their own memory?

Thanks in advance for your help

Genutek
  • 387
  • 1
  • 5
  • 11
  • Same kernel --> same array ---> probably same elements --->undefined behaviour. Try same kernel with a different name and different buffers . – huseyin tugrul buyukisik Feb 14 '14 at 13:34
  • are you suggesting to create copies of the kernel with different names for each thread? – Genutek Feb 14 '14 at 13:43
  • If you can create a new .cu file programmatically from initial .cu file, yes, you should try. Compile time can increase. I used that once for my raytracer recursive function(fake recursivity) – huseyin tugrul buyukisik Feb 14 '14 at 13:44
  • are you doing any [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) – Robert Crovella Feb 14 '14 at 16:05
  • 2
    I don't think there's any problem calling the same kernel 100 times from 100 different threads. Here's a [simple example](http://pastebin.com/pUZ6Ufug) using pthreads. It's likely that if you do proper error checking, and review your data management carefully, you'll discover the issue. Or else post a simple example that fails. You should not have to create kernels with different names for each thread. You should not have two threads execute the same identical `cudaMalloc()` command. A pointer should be passed to `cudaMalloc` only once, before it is passed to `cudaFree()` – Robert Crovella Feb 15 '14 at 03:10
  • Are you using constant or texture memory ? Those would be limited only for one instance of the class, Please specify your CC version – TripleS Feb 15 '14 at 18:54
  • yes I am using texture memory in my kernel.I think this is the main cause then. I am using CUDA5.5 on GTX 690. Might upgrade to CUDA 6.0 now. – Genutek Feb 16 '14 at 21:21

1 Answers1

5

Usually one CPU thread can be used for calling a CUDA kernel. However, since CUDA 4.0, multiple CPU threads can share context. You can use cuCtxSetCurrent to tie the context of the kernel to the current thread. More information about this API function can be found here.

Another workaround for this is to create a GPU worker thread that holds the context and pass any CUDA request to that thread.

Regarding your other question, without setting the context for the proper thread, I remember that cudaMalloc would not even execute (I work with JCuda so the behavior may be a little different). But if the context is currently set to the calling kernel, the memories will not be overwritten.

Maghoumi
  • 3,295
  • 3
  • 33
  • 49
  • Thanks a lot for your reply. I would defintely use this in my future tasks, but for my current case, as [TripleS](http://stackoverflow.com/users/907166/triples) suggested texture memory will have only one instance. I think the main issue is with the texture memory. If time permits, I will try replacing the texture with global memory just to check if the problem is still remains. – Genutek Feb 16 '14 at 21:31