I've read about that the dynamic parallelism is supported in the newer version of CUDA, and I can call thrust functions like thrush::exclusive_scan
inside a kernel function with thrust::device
parameter.
__global__ void kernel(int* inarray, int n, int *result) {
extern __shared__ int s[];
int t = threadIdx.x;
s[t] = inarray[t];
__syncthreads();
thrust::exclusive_scan(thrust::device, s, n, result);
__syncthreads();
}
int main() {
// prep work
kernel<<<1, n, n * sizeof(int)>>>(inarray, n, result);
}
The thing I got confused is:
- When calling thrust function inside a kernel, does each thread call the function once and they all do a dynamic parallelism on the data?
- If they do, I only need one thread to call
thrust
so I can just do aif
tothreadIdx
; if not, how do threads in a block communicate with each other that the call to thrust has been done and they should just ignore it(this seems a little imaginary since there wouldn't be a systematical way to ensure from user's code). To summerize, what's exactly happening when I call thrust functions withthrust::device
parameter inside a kernel?