I have started learning cuda for a while and I have the following problem
See how I am doing below:
Copy GPU
int* B;
// ...
int *dev_B;
//initialize B=0
cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...
//Execute on GPU the following function which is supposed to fill in
//the dev_B matrix with integers
findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);
Copy CPU again
cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);
- Copying array B to dev_B takes only a fraction of a second. However copying array dev_B back to B takes forever.
The findNeiborElem function involves a loop for each thread e.g. it looks like that
__ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){ int tid=threadIdx.x + blockIdx.x * blockDim.x; while (tid<dev_Nel[0]){ for (int j=1;j<=Nel;j++){ // do some calculations B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach break; } tid += blockDim.x * gridDim.x; } }
What's very wierd about it, is that the time to copy dev_B to B is proportional to the number of iterations of j index.
For example if Nel=5
then the time is approx 5 sec
.
When I increase the Nel=20
the time is about 20 sec
.
I would expect that the copy time should be independent of the inner iterations one need to assign the value of the Matrix dev_B
.
Also I would expect that the time to copy the same matrix from and to CPU would be of the same order.
Do you have any idea what is wrong?