-1

I'm writing CUDA kernel and threads are performing following tasks :

for example i have array of [1, 2, 3, 4] then I want answer [12, 13, 14, 23, 24, 34]

Suppose I've an array with n integers and i've two indexes i and j.

simple solution for that in C language will be :

k=0;
for (i = 0; i < n - 1; i++)
    for(j = i+1; j < n-1 ; j++)
       { new_array[k] = array[i]*10 + array[j];
          k++;
        }

In CUDA I've tried my luck :

for(i = threadIdx.x + 1; i < n-1; i++ )
    new_array[i] = array[threadIdx.x] * 10 + array[i];

But I think this is not totally correct or optimal way to do this. can anyone suggest anything better?

  • The problem you describe above (pairwise concatenation of chars) seems to be totally different from that illustrated by the `for` loops. In the latter case, it seems that you are facing an [all-prefix-sum operation](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html) for the internal loop. – Vitality Feb 10 '14 at 14:04
  • thnanku @JackOLantern . I'll check that out. – yogeshwar_misal Feb 10 '14 at 14:39
  • This is a repeat of a question you've posted previously and now deleted. The pure C code you've posted [does not create the answer you are suggesting](http://pastebin.com/xFWMwZG5). @JackOLantern can you post your response as an answer, I would upvote. – Robert Crovella Feb 10 '14 at 15:05
  • Damn sorry about the above code. I've changed it a bit. @RobertCrovella will you try it again. Pleaseee.. – yogeshwar_misal Feb 10 '14 at 15:32
  • 1
    Why don't you provide a complete code, as I have, with all include files and everything necessary to compile and run a program, showing the actual results when you run it, as I have. Likewise you should do this for the CUDA code you've posted. This demonstrates that to some degree you understand the question you are asking, and have some basic knowledge needed. In fact, SO [expects this](http://stackoverflow.com/help/on-topic): "3.Questions asking for code must demonstrate a minimal understanding of the problem ..." – Robert Crovella Feb 10 '14 at 16:29
  • @RobertCrovella The OP has edited his question and changed his sample code, which does not appear anymore to be an all-prefix-sum operation, at least according to the definition I know ([Parallel Prefix Sum (Scan) with CUDA ](http://beowulf.lcs.mit.edu/18.337-2008/lectslides/scan.pdf)). Furthermore, the current code is different from the one you have linked to. Finally, the innner `for` loop of that code practically does not operate and I believe that the code could be rewritten by a single, proper `for` loop. – Vitality Feb 10 '14 at 21:36
  • @RobertCrovella I have added an answer. – Vitality Feb 11 '14 at 21:00

1 Answers1

2

I'm assuming that the code you want to port to CUDA is the following:

#include <stdio.h>
#define N 7

int main(){

    int array[N] = { 1, 2, 3, 4, 5, 6, 7};
    int new_array[(N-1)*N/2] = { 0 };

    int k=0;
    for (int i = 0; i < N; i++)
        for(int j = i+1; j < N; j++)
        { 
            new_array[k] = array[i]*10 + array[j];
            k++;
        }

    for (int i = 0; i < (N-1)*N/2; i++) printf("new_array[%d] = %d\n", i, new_array[i]);

return 0;

}

You may wish to note that you can recast the interior loop as

for (int i = 0; i < N; i++)
    for(int j = i+1; j < N; j++)
        new_array[i*N+(j-(i+1))-(i)*(i+1)/2] = array[i]*10 + array[j];

which will avoid the explicit definition of an index variable k by directly using index i*N+(j-(i+1))-(i)*(i+1)/2. Such an observation is useful becuase, if you interpret the indices i and j as thread indices in the ported code, then you will have a mapping between the 2d thread indices and the index needed to access the target array in the __global__ function you have to define.

Accordingly, the ported code is

#include <stdio.h>
#define N 7

__global__ void kernel(int* new_array_d, int* array_d) {

    int i = threadIdx.x;
    int j = threadIdx.y;

    if (j > i) new_array_d[i*N+(j-(i+1))-(i)*(i+1)/2] = array_d[i]*10 + array_d[j];
}

int main(){

    int array[N] = { 1, 2, 3, 4, 5, 6, 7};
    int new_array[(N-1)*N/2] = { 0 };

    int* array_d;       cudaMalloc((void**)&array_d,N*sizeof(int));
    int* new_array_d;   cudaMalloc((void**)&new_array_d,(N-1)*N/2*sizeof(int));

    cudaMemcpy(array_d,array,N*sizeof(int),cudaMemcpyHostToDevice);

    dim3 grid(1,1);
    dim3 block(N,N);
    kernel<<<grid,block>>>(new_array_d,array_d);

    cudaMemcpy(new_array,new_array_d,(N-1)*N/2*sizeof(int),cudaMemcpyDeviceToHost);

    for (int i = 0; i < (N-1)*N/2; i++) printf("new_array[%d] = %d\n", i, new_array[i]);

    return 0;
}

Please, add your own CUDA error check in the sense of What is the canonical way to check for errors using the CUDA runtime API?. Also, you may wish to extend the above CUDA code to the case of block grids of non-unitary sizes.

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146