1

I am new in CUDA programing, and I have bit of a problem. I was trying to write a program, which needs interthread communication; I tried it on every possible ways I have found, and it is still not working. What do you think, what am I missing?

The code snippet below is my entire program. It starts 2 threads in the very same block. They get an input, and an output array, and another global variable to communicate through. The value 0 means that the variable is empty, and so writable. Basicly the first one reads an element from the input, passes the value to the second one, which writes it into the output array. Later it supposed to be a pipeline, with more threads between A and B.

#include <cuda.h>
#include <cuda_runtime.h>

#include <stdio.h>

#define N 1

__global__ void link(int *in, int *out, int *pipe){ 
    int id = threadIdx.y*blockDim.x + threadIdx.x;  //compute index     

    if(id == 0){        //writer thread

        for(int index = 0;index<N;){                
            if(pipe[0]==0){             
                atomicExch(pipe, in[index++]);              
            }           
        }
    }
    else if(id == 1){   // reader thread    

        for(int index=0;index<N;)   {
            if(pipe[0]!=0){
                out[index++] = atomicExch(pipe, 0); //read and make it empty    
            }           
        }           
    }
}

int main(){
    int input[] = {8,7};
    int *dev_input;
    int *dev_output;
    int *dev_pipe;
    int *output = (int*) malloc (N*sizeof(int));

    cudaMalloc((void**) &dev_input, N*sizeof(int));
    cudaMalloc((void**) &dev_output, N*sizeof(int));
    cudaMalloc((void**) &dev_pipe, 1*sizeof(int));
    cudaMemset(dev_pipe, 0, 1);
    cudaMemcpy(dev_input, &input[0], N*sizeof(int), cudaMemcpyHostToDevice);

    link<<<1, 2>>>(dev_input, dev_output, dev_pipe);

    cudaMemcpy(output, dev_output, N*sizeof(int), cudaMemcpyDeviceToHost);

    printf("[%d", output[0]);
    for(int i = 1;i<N;i++)
        printf(", %d", output[i]);
    printf("]\n");
    int d = 0;
    scanf("\n", &d);

}

If the reader see, that the pipe is 0 (empty), it puts the first element on it, but the writer can not see any changes, and the program goes into a deadlock. I tried to add __threadfence and __syncthreads, but it did not help. I also tried volatile shared memory, but it didn't work either. Please help me, if you can, because I have no idea, what is wrong with it.

  • Try [this](http://pastebin.com/mSAZm86S). Also its always good to do some error checking mentioned in [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). – Sagar Masuti Nov 21 '13 at 11:37
  • Thank you, I will try this, but I'm affraid it wouldn't work either. By the time the first thread reaches __syncthread() wouldn't it have passed all the data to the second one? But the second won't take any of them before the __syncthreads() get called, so it could be a deadlock again. Or am I thinking wrong again? – user3017074 Nov 21 '13 at 12:24
  • The `__syncthread()` is a like a barrier. All the threads in a block reach there and then continue further execution. Note the answer posted below as it points out some mistakes(increment inside not so good idea ie.,`in[index++]`). If you are sure you want to continue using cuda then you need to have proper understanding of the thread, block and grid concepts. Also one more error to point out. `cudaMemset(dev_pipe, 0, 1);` should be `cudaMemset(dev_pipe, 0, 1*sizeof(int));` If you fix this and use the code pasted above by me it should work. And most important dont forget to do error checking. – Sagar Masuti Nov 21 '13 at 12:30

1 Answers1

3

Beware, CUDA threads are much different from POSIX threads. They work according to the Single Instruction Multiple Threads paradigm (SIMT, see this interesting discussion): at each clock tick, every threads (in a single 'wrap') runs the same (low-level) instruction.

In you code, the writer thread will run while the reader thread will perform NOPs and then, the second thread will run while the first one performs NOPs but they will never run simultaneously so you would not benefit from the massively parallel structure of GPUs.

Anyway, to answer your question, your for loops

for(int index=0;index<N;)

do not increment index; so they are infinite loops. Replace with

for(int index=0;index<N;index++)

And share memory, that is common to all threads within the same block, would be much faster than global memory for inter-thread communication.

damienfrancois
  • 52,978
  • 9
  • 96
  • 110
  • Thank you for answering me! So you are saying, even if I solve the problem, It will be horribly ineffective? Then it's a bad idea to do this on GPU, isn't it? – user3017074 Nov 21 '13 at 12:26
  • I'm afraid so. GPUs are designed to perform data-parallel tasks, that is have many many threads performing the same operation on many independent data chunks. It could work on a Xeon Phi though. But remember that those accelerators work at rather low clock frequency so you need to make optimal use of multithreading to gain performances. – damienfrancois Nov 21 '13 at 12:31
  • For inter-thread communication in CUDA, you generally should use shared memory. Atomics are best used for inter-block communication. – ArchaeaSoftware Nov 21 '13 at 21:53