Cuda __syncthreads undefined. Without it-> random results

Question

I am new with cuda and I have a problem. I want to put a synchronization to my threads so I tried to use syncthreads. The problem is that Visual Studio 2010 says: idetifier __syncthreads() is undefined... I am using cuda 4.2 by the way. So I decided to use cudaDeviceSynchronize() instead and call it from host. My code is something like the above (i send to you only the important parts):

__global__ void sum( float avg[]){
  avg[0]+=1;
  avg[1]+=2;
}
int main(){
  float avg[2];
  float *devAvg;
  cudaError_t cudaStatus;
  size_t size=sizeof(unsigned char)*2;
  cudaStatus = cudaMalloc((void**)&devAvg, size2);
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc 2 failed!");
    return -1;
  }
  avg[0]=0;
  avg[1]=0;
  cudaStatus = cudaMemcpy(devAvg,avg, size, cudaMemcpyHostToDevice);
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    return -1;
  }
  dim3 nblocks(40,40);
  dim3 nthreads(20,20);
  sum<<<nblocks,nthreads,msBytes>>>(devAvg);
  cudaStatus = cudaDeviceSynchronize();
  if (cudaStatus != cudaSuccess) {
      fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
  }

  cudaStatus = cudaMemcpy(avg,devAvg,size,cudaMemcpyDeviceToHost);
  if (cudaStatus != cudaSuccess) {
      fprintf(stderr, "cudaMemcpy Device to Host failed!");
      return -1;}
  cout<<"avg[0]="avg[0]<<" avg[1]="<<avg[1]<<endl;
  cudaFree devAvg;
  return 0;
  }

I thought that the results should be avg[0]=640.000 avg[1]=1.280.000

but not only my results are different(this could be an overflow problem) but they does not be stable. For example for three different executions the results are:

avg[0]=3041 avg[1]=6604

avg[0]=3015 avg[1]=6578

avg[0]=3047 avg[1]=6600

So what I am doing wrong here?Is it a synchronization problem?And why I cannot use __syncthreads() Or is it the problem of race conditions?

Additionally for the __syncthreads() problem it comes with any code that I write. Even the simplest one:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <Windows.h>


// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] * a[idx];
  __syncthreads();
}

// main routine that executes on the host
int main(void)
{
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:
  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  square_array <<< n_blocks, block_size >>> (a_d, N);
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);
      return 0;
}

It is saying this: Error: identifier "__syncthreads()" is undefined

The funny part is that even with the sample codes that comes with the 4.2 CUDA SDK the same thing happens... Maybe is something more general wrong because there are more functions in the SDK samples that are considered undefined.

I'm not sure what to say about the `__syncthreads()` issue. The code you added compiles and runs fine for me. I think it indicates that there is something wrong with your environment. You might want to follow the steps carefully in the [windows getting started guide](http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-microsoft-windows/index.html) (after first uninstalling the cuda version that you have.) — Robert Crovella, Dec 12 '12 at 19:49
It looks like the __syncthreads() problem may be due to an interaction between visual studio and the specific include files you have. When compiling .cu files with nvcc (even in VS) it's normally not necessary to specifically include `cuda_runtime.h` and `device_launch_parameters.h` (you'll notice my answer does not contain those) Try removing those include statements from your source files and then see if you can compile with `__syncthreads()` — Robert Crovella, Dec 25 '12 at 14:14

Robert Crovella · Answer 1 · 2012-12-12T17:40:07.907

All of your blocks of threads are writing to the same two locations. The only way to make this work properly is to use atomic operations. Otherwise the results of threads reading the location, adding to it and writing the result back to the location "simultaneously" is undefined.

If you rewrite your kernel as follows:

__global__ void sum( float avg[]){
   atomicAdd(&(avg[0]),1);
   atomicAdd(&(avg[1]),2);
}

It should resolve the issue you are seeing.

To answer the question about __syncthreads(), I would need to see the exact code that caused the compiler error. If you post that, I'll update my answer. There shouldn't be a problem with inserting a __syncthreads() call in this kernel, although it won't fix the problem you are seeing.

You may wish to review the atomic operations section of the C programming guide.

Note that using atomics generally will cause your code to run slower, so they should be used carefully. However for this learning exercise it should sort out the issue for you.

also note that the code you posted doesn't compile cleanly, there are a number of missing definitions, and a variety of other issues with your code. But since you are posting results, I assume you have some version of this working, even though you haven't posted it. Therefore I haven't identified every issue with the code that you have posted.

Here is code that is similar to yours with all of the various coding issues fixed, and it seems to work for me:

#include <stdio.h>
#include <iostream>

#define msBytes 0

__global__ void sum( float avg[]){
  atomicAdd(&(avg[0]),1);
  atomicAdd(&(avg[1]),2);
}
int main(){
  float avg[2];
  float *devAvg;
  cudaError_t cudaStatus;
  size_t size=sizeof(float)*2;
  cudaStatus = cudaMalloc((void**)&devAvg, size);
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc 2 failed!");
    return -1;
  }
  avg[0]=0;
  avg[1]=0;
  cudaStatus = cudaMemcpy(devAvg,avg, size, cudaMemcpyHostToDevice);
  if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    return -1;
  }
  dim3 nblocks(40,40);
  dim3 nthreads(20,20);
  sum<<<nblocks,nthreads,msBytes>>>(devAvg);
  cudaStatus = cudaDeviceSynchronize();
  if (cudaStatus != cudaSuccess) {
      fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
  }

  cudaStatus = cudaMemcpy(avg,devAvg,size,cudaMemcpyDeviceToHost);
  if (cudaStatus != cudaSuccess) {
      fprintf(stderr, "cudaMemcpy Device to Host failed!");
      return -1;}
  std::cout<<"avg[0]="<<avg[0]<<" avg[1]="<<avg[1]<<std::endl;
  cudaFree(devAvg);
  return 0;
  }

I get the following output when I run it:

avg[0]=640000 avg[1]=1.28e+06

Also note that for atomicAdd to be usable on float, it's necessary to have a compute capability 2.0 or better device (and to pass the compiler switch e.g. -arch=sm_20 to compile for that kind of device). If you have an earlier device (compute capability 1.x) then you can create a similar program defining avg[] as int instead of float. Or if you prefer, you can create your own atomicAdd __ device__ function that is usable on a cc 1.x device as suggested here in the section beginning with "Note however that any atomic operation can be implemented based on atomicCAS() (Compare And Swap). ".

Thanks a lot about your answer!I will try the atomicAdd...You are write that i have a version of the code that compiles...I edit my question for more informations about the __syncthreads() problem. — user1870996, Dec 12 '12 at 19:27

Cuda __syncthreads undefined. Without it-> random results

1 Answers1