If I start my kernel with a grid whose blocks have dimensions:
dim3 block_dims(16,16);
How are the grid blocks now split into warps? Do the first two rows of such a block form one warp, or the first two columns, or is this arbitrarily-ordered?
Assume a GPU Compute Capability of 2.0.