5

If have a matrix and I only want to access to the lower triangular part of the matrix. I am trying to find a good thread index but so far I have not managed it. Any ideas? I need and index to loop over the lower triangular matrix, say this is my matrix

1 2 3 4
5 6 7 8
9 0 1 2
3 5 6 7

the index should go for

1 
5 6
9 0 1
3 5 6 7

in this example, positions 0,4,5,8,9,10,12,13,14,15 of a 1D array.

The CPU loop is:

for(i = 0; i < N; i++){
    for(j = 0; j <= i; j++){
             .......

where N is the number of rows. I was trying something in the kernel:

 __global__ void Kernel(int N) {

        int row = blockIdx.x * blockDim.x + threadIdx.x;
        int col = blockIdx.y * blockDim.y + threadIdx.y;
        if((row < N) && (col<=row) )
           printf("%d\n", row+col);
      }

and then call it this way:

 dim3 Blocks(1,1);
 dim3 Threads(N,N);
 Kernel<<< Blocks, Threads>>>(N);

but it doesn't work at all. What I get:

0
1
2
2
3
4
Manolete
  • 3,431
  • 7
  • 54
  • 92
  • i think the condition should be `if ((row – sgarizvi Sep 11 '12 at 11:21
  • You are partially right but I think the whole idea is wrong – Manolete Sep 11 '12 at 11:25
  • the idea is correct. What is the number of threads you are launching? If the number of threads per block exceeds the device limit, the kernel launch would fail. – sgarizvi Sep 11 '12 at 11:28
  • @sgar91 Please see my last edit, that is exactly what I am now executing, and I am not getting close, something I am doing wrong here. Following the example of a 4x4 matrix, I am launching the kernel with N=4 threads on each dimension – Manolete Sep 11 '12 at 11:37
  • there must be something else, some very minor mistake, because the logic of the program is correct, but it is printing only the first 3 rows. – sgarizvi Sep 11 '12 at 11:49

2 Answers2

8

You're launching a grid of threads and then disabling all those above the diagonal, i.e. ~50% of threads will do nothing which is very inefficient.

The simple fix for your code is to fix the index:

__global__ void Kernel(int N)
{
  int row = blockIdx.x * blockDim.x + threadIdx.x;
  int col = blockIdx.y * blockDim.y + threadIdx.y;
  if((row < N) && (col<=row) )
    printf("%d\n", row * N + col);
}

Perhaps a more efficient, but more complex, solution would be to launch the correct number of threads and convert the index. Check out this answer for starting points...

Community
  • 1
  • 1
Tom
  • 20,852
  • 4
  • 42
  • 54
3

The problem is that we are indexing a 1D array so in order to map it we need to multiply the row index with the number of columns, therefore following the example:

__global__ void Kernel(int N) {
        int row = blockIdx.x * blockDim.x + threadIdx.x;
        int col = blockIdx.y * blockDim.y + threadIdx.y;
        if((row < N) && (col<=row) )
           printf("%d\n", row*N + col);
 }
Manolete
  • 3,431
  • 7
  • 54
  • 92
  • 1
    @harrism: I think Manolete wrote this independently at more or less the same time as me. – Tom Sep 12 '12 at 08:17