I get confuser reading about setting proper values for number of threads and blocks for cuda programming. After reading several guide and many tips i don't get answer that i search. My GPU: Nvidia GT645M Compute capability: 3.0
For what i know:
The maximum number of threads per block is in my case 1024 (32 x 32)
The maximum number of block in grid is in my case 2**31 - 1 = 4294967295 blocks
Multi processor count = 2
The number of block is depending on input data = (input data)/(number of threads per block)
For input data like [1,2,....10] + [1,2,....10] i need 10 threads and 1 block.
My computing problem:
For example i have input data like this:
n = 10
x = np.arange(n).astype(np.float32)
y = x + 1
I try make actions on this vector like: '+', '-', '*' by value
Question 1:
My knowledge:
The GPU cuda calculations working like this:
for each value in numpy array --> cuda block is used with one thread.
I mean
(x = [0,1,....9]) + (y = [1,2,....10]) = x[0] + y[0] in: block(0,0),thread(0,0),
then x[1] + y[1] in: block(0,0),thread(1,0) and so on.
Is that correct?
Question 2:
Let say:
thread count = 5
block count = 1
then all threads in this one block will by running 2 times for x + y?
Question 3:
How many block can running simultaneously in one time?
If you can explain step by step calculation on cuda by simple vector example, that will by nice.
Thanks for all you help, please don't give any links to cuda guide i get confused reading them. Please give simple examples :)