CUDA kernel strange behavior, generating random values

Question

I'm trying to simulate a spring-mass system with CUDA. Below is the kernel that updates the position of the particles:

__global__ void timestep(float3 *pos, float3 *pos_antiga, float3 *aceleracao, int numParticulas) {

    int index = threadIdx.x + blockIdx.x * blockDim.x;

    if(index > (numParticulas - 1))
        return;
t
    float3 temp = pos[index];

    pos[index].x = pos[index].x + (pos[index].x - pos_antiga[index].x) * (1.0f - DAMPING) + aceleracao[index].x * TIMESTEP;
    pos[index].y = pos[index].y + (pos[index].y - pos_antiga[index].y) * (1.0f - DAMPING) + aceleracao[index].y * TIMESTEP;
    pos[index].z = pos[index].z + (pos[index].z - pos_antiga[index].z) * (1.0f - DAMPING) + aceleracao[index].z * TIMESTEP;

    pos_antiga[index] = temp;

}

The pos represents the 3D vector of the actual position, pos_antiga is the position in the previous time step, the DAMPING is 0.01 and TIMESTEP is 0.25. I'm using Verlet integration. In a system without any force, aceleracao is zero, so the pos and pos_antigo are the same before and after the kernel calls.

However, after the first iteration, CUDA returns crazy values for some axes, like 1.QNAN and -1.6241e+016. I think it's something related to block and thread sizes. The kernel call is below:

timestep<<<16, 16>>>(pos_d, pos_antiga_d, aceleracao_d, numParticulas);

So, what am I missing?

EDIT: Below is the caller code:

void timestepGPU(vector<Particula> *p) {

// vector<Particula> has all the particles of the system.

      // CPU
    float *pos;
    float *pos_antiga;
    float *aceleracao;

    // GPU
    float *pos_d;
    float *pos_antiga_d;
    float *aceleracao_d;

    // Number of particles
    int numParticulas = p->size();

    // Init
    pos = new float[numParticulas * 3];
    pos_antiga = new float[numParticulas * 3];
    aceleracao = new float[numParticulas * 3];

    // Transfering the values from the class to a plain vector
    vector<Particula>::iterator p_tmp;
    int i = 0;
    for(p_tmp = p->begin(); p_tmp != p->end(); p_tmp++)
    {
        pos[i] = (*p_tmp).getPos().f[0];
        pos[i + 1] = (*p_tmp).getPos().f[1];
        pos[i + 2] = (*p_tmp).getPos().f[2];

        pos_antiga[i] = (*p_tmp).getPosAntiga().f[0];
        pos_antiga[i + 1] = (*p_tmp).getPosAntiga().f[1];
        pos_antiga[i + 2] = (*p_tmp).getPosAntiga().f[2];

        aceleracao[i] = (*p_tmp).getAceleracao().f[0];
        aceleracao[i + 1] = (*p_tmp).getAceleracao().f[1];
        aceleracao[i + 2] = (*p_tmp).getAceleracao().f[2];

        i += 3;
    }

    // Here, I print the particle data BEFORE moving it to GPU
      cout << "PRINT PARTICLE DATA" << endl;
    for(i = 0; i < numParticulas * 3; i += 3) {
        cout << i/3 << " - Pos: " << pos[i] << " " << pos[i + 1] << " " << pos[i + 2] << " | Pos Ant: " << pos_antiga[i] << " " << pos_antiga[i + 1] << " " << pos_antiga[i + 2] << " | Acel: " << aceleracao[i] << " " << aceleracao[i + 1] << " " << aceleracao[i + 2] << endl;
    }
    cout << "END" << endl;

    // GPU
    ErroCUDA(cudaMalloc((void**) &pos_d, numParticulas * 3 * sizeof(float)));
    ErroCUDA(cudaMalloc((void**) &pos_antiga_d, numParticulas * 3 * sizeof(float)));
    ErroCUDA(cudaMalloc((void**) &aceleracao_d, numParticulas * 3 * sizeof(float)));

    // Moving data
    ErroCUDA(cudaMemcpy(pos_d, pos, numParticulas * 3 * sizeof(float), cudaMemcpyHostToDevice));
    ErroCUDA(cudaMemcpy(pos_antiga_d, pos_antiga, numParticulas * 3 * sizeof(float), cudaMemcpyHostToDevice));
    ErroCUDA(cudaMemcpy(aceleracao_d, aceleracao, numParticulas * sizeof(float), cudaMemcpyHostToDevice));

    // Setting number of blocks and threads per block
    unsigned int numThreads, numBlocos;
    calcularGrid(numParticulas, 64, numBlocos, numThreads);
    //cout << numBlocos << "----------" << numThreads << endl;

    // Kernel
    timestep<<<numBlocos, numThreads>>>((float3 *) pos_d, (float3 *) pos_antiga_d, (float3 *) aceleracao_d, numParticulas);
    ErroCUDA(cudaPeekAtLastError());
    cudaDeviceSynchronize();

    // Moving data back to the CPU
    ErroCUDA(cudaMemcpy(pos, pos_d, numParticulas * 3 * sizeof(float), cudaMemcpyDeviceToHost));
    ErroCUDA(cudaMemcpy(pos_antiga, pos_antiga_d, numParticulas * 3 * sizeof(float), cudaMemcpyDeviceToHost));

      // Printing the particles' data AFTER Kernel call. At my GT 4xx, close to the 48th particle, it starts to show crazy values
    cout << "PARTICLE DATA" << endl;
    for(i = 0; i < numParticulas * 3; i += 3) {
        cout << i/3 << " - Pos: " << pos[i] << " " << pos[i + 1] << " " << pos[i + 2] << " | Pos Ant: " << pos_antiga[i] << " " << pos_antiga[i + 1] << " " << pos_antiga[i + 2] << " | Acel: " << aceleracao[i] << " " << aceleracao[i + 1] << " " << aceleracao[i + 2] << endl;
    }
    cout << "END" << endl;

    system("pause");

    i = 0;
    for(p_tmp = p->begin(); p_tmp != p->end(); p_tmp++)
    {
        if((*p_tmp).getMovel())
        {
            (*p_tmp).setPos(Vetor(pos[i], pos[i + 1], pos[i + 2]));
            (*p_tmp).setPosAntiga(Vetor(pos_antiga[i], pos_antiga[i + 1], pos_antiga[i + 2]));
(*p_tmp).setAceleracao(Vetor(0, 0, 0));
        }

        i += 3;
    }

    ErroCUDA(cudaFree(pos_d));
    ErroCUDA(cudaFree(pos_antiga_d));
    ErroCUDA(cudaFree(aceleracao_d));

    free(pos);
    free(pos_antiga);
    free(aceleracao);
}

In my example, attribute p has 100 items (10 x 10 particles). It's a mesh starting at (0, 0, 0) and goes to (20, 20, 20) in 3D space.

Thanks again everyone for the help!

SO expects: "Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance. " You haven't provided an SSCCE.org code. Please provide a *complete* compilable reproducer of the problem. Please also make sure you are doing [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on all kernel launches and CUDA API calls. — Robert Crovella, Sep 27 '13 at 03:57
I cannot spot anything obviously out of order. Does your code check the status of every CUDA API call and every kernel launch? Have you tried running it under cuda-memcheck? — njuffa, Sep 27 '13 at 03:58
You can try printing the values of pos and pos_antiga before and after your calculation for particular thread or index and check manually if the values are correct. Atleast you should be able to narrow down your problem with this. — Sagar Masuti, Sep 27 '13 at 04:41
@RobertCrovella There is the caller code. The problem arises when, at the kernel, I add the "+ aceleracao[index] * TIMESTEP" to the formula. I'm running it from Visual Studio, Debug mode. As you can see, all CUDA checks are made. — Herleson Pontes, Sep 27 '13 at 11:55
@njuffa I ran this code in NSight to VS 2012, and no problem arise. — Herleson Pontes, Sep 27 '13 at 11:57
@SagarMasuti I already did that. As you can se the code, the values are printed before and after the kernel call. — Herleson Pontes, Sep 27 '13 at 11:58
@HerlesonPontes: I meant print inside the kernel for a particular index like first, middle and last value. — Sagar Masuti, Sep 27 '13 at 12:16

score 2 · Accepted Answer · answered Sep 27 '13 at 12:14

2

I think your problem is in this line..

 ErroCUDA(cudaMemcpy(aceleracao_d, aceleracao, numParticulas * sizeof(float), cudaMemcpyHostToDevice));

should be ..

  ErroCUDA(cudaMemcpy(aceleracao_d, aceleracao, numParticulas * 3 *sizeof(float), cudaMemcpyHostToDevice));

answered Sep 27 '13 at 12:14

Sagar Masuti

1,271
2
11
30

Indeed, the number 3 was missing. Thank you for your answer! However, the acceleration of the system is zero. Does CUDA has "dirty" values for spaces that there is no value set, just like C? I'll check this modification soon, because now I'm not in front of a CUDA capable machine. – Herleson Pontes Sep 27 '13 at 12:20
Yes!! The cudaMalloc documentation says: `cudaError_t cudaMalloc ( void ** devPtr, size_t size )` "Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. **The memory is not cleared**. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure." – Sagar Masuti Sep 27 '13 at 12:28
Thanks for the reference. I'm beginning using CUDA for some researches, so my knowledge is not solid. I'll try this correction and see if this corrects everything. Thanks again. – Herleson Pontes Sep 27 '13 at 12:28
Your welcome. I didnt test it either.. Hopefully it solves your problem. – Sagar Masuti Sep 27 '13 at 12:35

CUDA kernel strange behavior, generating random values

1 Answers1