0
using namespace std;
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

const int threadsPerBlock = 256;
const int blocksPerGrid = 1024;
const int N = 64;

 __global__ void reverse(int *data, int count){

        __shared__ int cache[threadsPerBlock];
        int tid = threadIdx.x + blockIdx.x * blockDim.x;

        int cacheIndex = threadIdx.x;
        int tr = count-cacheIndex-1;
        if(tid< count/2)
        cache[cacheIndex] = data[cacheIndex];

        __syncthreads();
        data[cacheIndex] = cache[tr];
    }

int main(void){

    int a[N];
    int *devA;

    generate(a,N);

    cudaMalloc((void**)&devA, N * sizeof(int));


    cudaMemcpy(devA, a, N * sizeof(int), cudaMemcpyHostToDevice);

    reverse<<<blocksPerGrid,threadsPerBlock>>>(devA,N);

    cudaMemcpy(a,devA, N * sizeof(int), cudaMemcpyDeviceToHost);


    cout << a[63];

    cudaFree(devA);

}

Above code does not reverse my reverse. What is wrong with this program? What am i going wrong? I think everything is okay. What do i need to edit to work it properly? What is wrong?

alrikai
  • 4,123
  • 3
  • 24
  • 23
ehah
  • 675
  • 1
  • 7
  • 11

1 Answers1

2

You're launching way too many threads. For the algorithm you have, the number of threads needed is N. But you're launching 1024*256 threads.

Alternatively, and probably good coding practice, would be to wrap the code in your kernel with a thread check, like:

    int idx = threadIdx.x + blockDim.x*blockIdx.x;

    if (idx<count){
      // put your kernel code here
    }

Also, your kernel is written in such a way that it will really only work for data sizes that fit within a single threadblock.

It's probably better if you just review the solution proposed by @alrikai here. That solution doesn't require any synchronization or use of shared memory, so it's quite a bit simpler.

EDIT responding to a question below.

I made a mistake because I was thinking about alrikai's solution. I edited my code above. Try that.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • i edited my kernel function. Please look at it. It is not working. – ehah May 17 '13 at 19:32
  • Why not use the kernel proposed by alrikai that I linked to in my answer? The problem with the edits to the kernel you just made is that you don't understand what the if statement is supposed to do. You didn't use the curly-braces that I hinted at. *All* of your kernel code should be conditional on the if-statement I suggested. – Robert Crovella May 17 '13 at 19:47
  • i want to do it with shared memory .) – ehah May 17 '13 at 19:49
  • 1
    Ok I modified my answer, it should work now. *HOWEVER* as I indicated, the shared memory approach will only work for N that will fit within a single threadblock. So if your N goes beyond 1024 (or 256, if you don't increase your threads per block), the shared memory approach will break, because there is no synchronization between threadblocks and no guarantee of the order in which threadblocks execute. Shared memory really gives *no* benefit to this solution. – Robert Crovella May 17 '13 at 19:56
  • I edited my original answer. Please look above to my original answer, and you will see that it has changed from what I originally posted. – Robert Crovella May 17 '13 at 20:04