Suppose, I have a ROWS*COLS
=4*4
matrix
A=[[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]]
arranged in K1*K2
=2*2
tiles.
I want to traverse the elements of A such that I get the sequence as
B = [1,2,5,6,3,4,7,8,9,10,13,14,11,12,15,16].
Why don't we store the matrix in such a sequence in a 1D array in the first place?
Will this help our CUDA program to do matrix multiplication faster?