I have GTX 780 Nvidia GPU. As per the specification it has computeccapability of 3.5 with 16 blocks in each processor and maximum number of thread per multiprocessor is 2048. Thus in odder to full utilize each multiprocessor I have calculated
total thread to be used = 2048/16 =128
Is 128 is the best number of thread which should be used in calling kernel.like
CalcTemperatureFactor_Kernel<<<250,128,0,Stream>>>(ComputeParticleNum);
but with use of either 256 or 128 showed no any effect in the execution time???