Cuda atomicInc not working

Question

I'm implementing k-means, on GPU and for now i have the folowing code:

__device__ unsigned int cuda_delta = 0;

__global__ void kmeans_kernel(const sequence_t *data,
                          const sequence_t *centroids,
                          int * membership,
                          unsigned int n,
                          unsigned int numClusters )
{
int index = blockIdx.x * blockDim.x  + threadIdx.x;
if (index < n){

    int min_distance = INT_MAX;
    int nearest = -1;

    for (int i = 0; i < numClusters; i++){
        sequence_t centroid = centroids[i];
        int distance = distance(centroid, data[index]);
        if(distance < min_distance) {
            nearest = i;
            min_distance = distance;
        }
    }

    if(membership[index] != nearest) {
        membership[index]=nearest;
        atomicInc(&cuda_delta,n);
    }
}

As you can see, there is no data dependency on the algorithm, only in the variable cuda_delta, stored on global memory. According com the documentation:

An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory

It is exactly what i need. Edit - here is all my host code

unsigned int delta=0; //Number of objects has diverged in current iteration

label = (int*)calloc(data_size,sizeof(int));
centroids = (sequence_t*)calloc(clusters,sizeof(sequence_t));

// cuda variables
sequence_t * cuda_data = NULL;
sequence_t * cuda_centroids = NULL;
int *cuda_membership = NULL;
unsigned int *cuda_tmp_centroidCount = NULL;

const unsigned int threadsPerBlock = 1024;
const unsigned int numBlocks = (data_size + threadsPerBlock - 1) / threadsPerBlock;
const unsigned int numBlocks2 = (clusters + threadsPerBlock - 1) / threadsPerBlock;

for(unsigned int i = 0;i < clusters;i++) {
    int h = i * data_size / clusters;
    centroids[i] = make_ulong3(data[h].x,data[h].y,data[h].z);
}

memset (label,-1,data_size * sizeof(int));

checkCuda(cudaMalloc(&cuda_data, data_size * sizeof(sequence_t)));
checkCuda(cudaMalloc(&cuda_centroids, clusters * sizeof(sequence_t)));
checkCuda(cudaMalloc(&cuda_membership, data_size * sizeof(int)));
checkCuda(cudaMalloc(&cuda_tmp_centroidCount, clusters * BIT_SIZE_OF(sequence_t) *sizeof(unsigned int)));


checkCuda(cudaMemcpy(cuda_data,data, data_size *sizeof(sequence_t) , cudaMemcpyHostToDevice));
checkCuda(cudaMemcpy(cuda_centroids, centroids, clusters *sizeof(sequence_t) , cudaMemcpyHostToDevice));
checkCuda(cudaMemcpy(cuda_membership, label, clusters *sizeof(int) , cudaMemcpyHostToDevice));
int pc = 0;

do {

    cudaMemset (cuda_tmp_centroidCount,0,clusters * BIT_SIZE_OF(sequence_t) *sizeof(unsigned int));
    delta = 0;
    checkCuda(cudaMemcpyToSymbol(cuda_delta, &delta,sizeof(unsigned int),0,cudaMemcpyHostToDevice));
    kmeans_kernel <<< numBlocks,threadsPerBlock>>>(cuda_data,
                                                   cuda_centroids,
                                                   cuda_membership,
                                                   data_size,
                                                   clusters); 
    cudaDeviceSynchronize();
    checkCuda(cudaMemcpyFromSymbol(&delta,cuda_delta,sizeof(unsigned int)));
    printf ("%d - delta = %d\n",pc,delta);
    checkCuda(cudaGetLastError());
    pc++;
}
while(delta > 0);
// copy output
checkCuda(cudaMemcpy(label,cuda_membership, clusters *sizeof(int) , cudaMemcpyDeviceToHost));
checkCuda(cudaMemcpy(centroids,cuda_centroids, clusters *sizeof(sequence_t) , cudaMemcpyDeviceToHost));

// free cuda memory
checkCuda(cudaFree(cuda_data));
checkCuda(cudaFree(cuda_centroids));
checkCuda(cudaFree(cuda_membership));
checkCuda(cudaFree(cuda_tmp_centroidCount));
checkCuda(cudaDeviceReset());

The delta value printed on the first iteration changes if i run the code multiple times, and it shouldn't. Most of the time the values printed are:

0 - delta = 18630
0 - delta = 859

The expected value is 18634. Am i missing something here ?

Edit The full code is available on github, to run the example just compile using make. And run the program using the following arguments, multiple times and you will see the delta value for the first iteration is not always the expected.

./cuda-means mus_musmusculus.dat 859

Thanks in advanced!

For reference http://stackoverflow.com/questions/18008975/atomicinc-is-not-working — SinisterMJ, Sep 17 '13 at 03:12
Yes, i want cudaInc. I already looked up this question, but it's not the case. The n value is correctly, some time the value printed is the expected one "18634". But Thanks anyway. — Guilherme Torres Castro, Sep 17 '13 at 03:25
SO expects: "Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance. " You haven't provided an SSCCE.org code. — Robert Crovella, Sep 17 '13 at 04:03
You don't appear to be doing [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on the kernel. You might also try running your code with `cuda-memcheck`. — Robert Crovella, Sep 17 '13 at 04:06
some more code would be nice. where do you intialize your numBlocks, threadsPerBlock and cuda_delta symbols? — Michael Haidl, Sep 17 '13 at 10:22
@kronos I'm sorry but i just put the code i think was relevant to the question, the source and the IO file are too long to put in here IMHO. I run the cuda-memcheck and got 0 erros, also i'm checking erros on kernel, i just don't put the code in here. I will edit the question. — Guilherme Torres Castro, Sep 17 '13 at 11:39
@kronos, You can see where i intialize the cuda_delta just before the kernel definition, as i say to kronos, i just show the code that i think was relevant to the question. I will edit the question. — Guilherme Torres Castro, Sep 17 '13 at 11:41
So i just edited the question (added all my host code and reference for the git repository if somone wanna to see the full code), if you think more information is needed let me know. Thanks — Guilherme Torres Castro, Sep 17 '13 at 12:00
An SSCCE.org code is not your "full code". It's a small subset which compiles and demonstrates the problem. You are supposed to do some work to create this, not just dump your full code in a github repository. — Robert Crovella, Sep 17 '13 at 13:39

Michael Haidl · Answer 1 · 2013-09-19T05:20:04.917

cudaMemcpyToSymbol(cuda_delta, &delta,sizeof(unsigned int));

and

cudaMemcpyFromSymbol(&delta,cuda_delta,sizeof(unsigned int));

are your problems.

From the documentation:

    cudaError_t cudaMemcpyFromSymbol ( void* dst, const void* symbol, size_t count, size_t offset = 0, cudaMemcpyKind kind = cudaMemcpyDeviceToHost )
Copies data from the given symbol on the device.
    Parameters

dst
    - Destination memory address 
symbol
    - Device symbol address 
count
    - Size in bytes to copy 
offset
    - Offset from start of symbol in bytes 
kind
    - Type of transfer

cudaMemcpyFromSymbol expects the adress the symbole as second parameter not the device symbol.

You can optain the address of a symbol using cudaGetSymbolAddress ( void** devPtr, const void* symbol )

void* is pure evil...

This is working fine, the actual problem is describe in my answer. You can check that on http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Just control + f for "cudaMemcpyToSymbol(devData, &value, sizeof(float));" — Guilherme Torres Castro, Sep 19 '13 at 20:28

score 0 · Answer 2 · answered Sep 19 '13 at 20:19

0

Shame on me! The atomic operation was working perfectly.

I was not "memseting" membership array. After i fix it, everything is working.

answered Sep 19 '13 at 20:19

Guilherme Torres Castro

15,135
7
59
96

Cuda atomicInc not working

2 Answers2