This is mostly from the book "Computer Architecture: A Quantitative Approach."
The book states that groups of 32 threads are grouped and executed together in what's called the thread block, but shows an example with a function call that has 256 threads per thread block, and CUDA's documentation states that you can have a maximum of 512 threads per thread block.
The function call looks like this:
int nblocks = (n+255)/256
daxpy<<<nblocks,256>>>(n,2.0,x,y)
Could somebody please explain how thread blocks are structured?