Kernel using cos and sin in CUDA fails when I use more 476 threads per block

Question

I encountered a problem with sin and cos in CUDA compute capability 2.0. It doesn't appear when the code is compiled for CUDA compute capability 1.x. I made a simple code. I test it in GeForce GTX 550 Ti and GeForce GTX 480, both had to the same result. This is the code:

#include <cufft.h>
#include <stdio.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#pragma once 
#ifdef __INTELLISENSE__
void __syncthreads();
void atomicAdd(int*, int);
#endif

    __global__ void cuftFrequency(float* in_data, float* out_data, int N, int M, int fromM = 1)
{
    cuComplex s; 
    float t = 0; 
    for (int I = threadIdx.x + blockIdx.x * blockDim.x + fromM; I <= M; I += blockDim.x * gridDim.x)
    {
        s.x = 0;
        s.y = 0; 
        for (int J = 0; J < N; J++)
        {
            t = (6.0 * (J - N / 2)) / I;
            s.x += in_data[J] * cos(t);
            s.y += in_data[J] * sin(t);
        }
/************************* if no problem, array return values 500, else - same refuse
        out_data[I - fromM] = 500;//s.x * s.x + s.y * s.y;
    }
}

extern "C" __declspec(dllexport)    void cuftColorQualifierExec(float* data, float *spm, int N, int M, int fromM)
{   
    float* in_data_dev;
    float *furie_dev;

    cudaDeviceProp prop;
    int N_Dev;
    memset(&prop, 0, sizeof(cudaDeviceProp));
    prop.major = 2;
    prop.minor = 0;
    prop.maxThreadsPerBlock = M - fromM;
    cudaChooseDevice(&N_Dev, &prop);
    cudaSetDevice(N_Dev);
    cudaGetDeviceProperties(&prop, N_Dev);
    int N_thread = 576;
    int N_block = 2;
    int *Count_dev;

    cudaError_t err = cudaMalloc((void**)&in_data_dev, sizeof(float) * N);
    if (err != cudaSuccess)
        fprintf(stderr, "ERROR \"%s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__);

    err = cudaMemcpy(in_data_dev, data, sizeof(float) * N, cudaMemcpyHostToDevice);
    if (err != cudaSuccess)
        fprintf(stderr, "ERROR \"%s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__);

    err = cudaMalloc((void**)&furie_dev, sizeof(float) * (M - fromM + 1));
    if (err != cudaSuccess)
        fprintf(stderr, "ERROR \"%s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__);

    cuftFrequency<<<N_block, N_thread>>>(in_data_dev, furie_dev, N, M, fromM); 

    err = cudaDeviceSynchronize();
    if (err != cudaSuccess)
        fprintf(stderr, "ERROR \"%s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__);

    err = cudaMemcpy(spm, furie_dev, sizeof(float) * (M - fromM + 1), cudaMemcpyDeviceToHost);
    if (err != cudaSuccess)
        fprintf(stderr, "ERROR \"%s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__);

    err = cudaFree(furie_dev);
    if (err != cudaSuccess)
        fprintf(stderr, "ERROR \"%s:%d\n", cudaGetErrorString(err), __FILE__, __LINE__);
}

int main()
{
    int M = 1024, fromM = 1, N = 4000;
    float* data = new float[4000];
    float* spm = new float[M - fromM + 1];

    for (int I = 0; I < N; I++)
        data[I] = cos(6.0 * I);

    for (int I = 0; I < M - fromM + 1; I++)
        spm[I] = 0;

    cuftColorQualifierExec(data, spm, N, M, fromM);

    for (int I = 0; I < M - fromM + 1; I++)
        fprintf(stdout, "%d: %f\n", I, spm[I]);

    return 0;
}

When the number of threads is more than 576, this code don't work, and doesn't return errors. I specifically set 400 value to an array elements, to convince myself that the program reaches up to this point. When the program return correct value, please, change it and test again.

Why this code work correctly when I compile it with compute capability 1.x, but it not working correctly when compute capability 2.0?

Its probably a registers per thread issue. You're not doing proper error checking after your kernel calls so you would not trap a kernel launch error due to too many registers requested. — Robert Crovella, Sep 10 '13 at 15:04
@RobertCrovella: He is checking if ``cudaDeviceSynchronize()`` returns an error after the kernel launch. That should catch launch errors, right? — Roger Dahl, Sep 10 '13 at 15:56
No. It will not catch a certain type of launch error. Review proper cuda kernel error checking [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — Robert Crovella, Sep 10 '13 at 16:23
@RobertCrovella: Is the number of registers in architecture 1.1 more, than in 2.0? The hardware is same, and changes only compilation. How to foresee such situation? Only testing? What's command doing proper error checking error due to too many registers requested? — alvahtin, Sep 10 '13 at 16:33
@RobertCrovella: I saw your message before sent main. I found answer on the quetion: "What's command doing proper error checking error due to too many registers requested?", yes, I has error cudaErrorLaunchOutOfResources by cudaPeekAtLastError(). Thank's. But first quetions still opened. — alvahtin, Sep 10 '13 at 16:47
By "first question" I guess you mean the question about the differernce between 1.1 and 2.0 architectures? You can find a description of the hardware differences [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications) in table 12. Yes there are hardware differences between 1.1 and 2.0 in terms of registers. But also you should consider that the compiler may make different register usage when compiling for 1.1 vs. 2.0. And if you search on SO you will find many answers that explain how to manage register usage. — Robert Crovella, Sep 10 '13 at 17:00

score 1 · Accepted Answer · answered Sep 10 '13 at 16:58

1

There are hardware limitations to how many threads can be run per block. The limiting factors differ between GPU architectures and include number of available registers, available shared memory and maximum threads per block on each MP. You can determine what the limiting factor is for your GPU and application by using the CUDA Occupancy Calculator, which is included with CUDA.

answered Sep 10 '13 at 16:58

Roger Dahl

15,132
8
62
82

The CUDA Occupancy Calculator provieds information about number of threads, blocks, registers, etc. But how can I know how many registers are required for the any operation? Other kernels, which do not contain the sin and cos run successful. – alvahtin Sep 10 '13 at 17:44
Follow the instructions on the Help sheet in the CUDA Occupancy Calculator to find out how many registers are used by your kernels. – Roger Dahl Sep 10 '13 at 19:01
Specify "-Xptxas -v" to get nvcc to dump the number of registers and the amount of local memory used by the kernel. If replacing sin() by __sin() and cos() by __cos() fixes your problem, it may be that the driver's not able to allocate enough local memory for the threads. – ArchaeaSoftware Sep 10 '13 at 21:18

Kernel using cos and sin in CUDA fails when I use more 476 threads per block

1 Answers1