0

I making a CUDA program and am stuck at a problem. I have two functions:

  1. __global__ void cal_freq_pl(float *, char *, char *, int *, int *)
  2. __global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)

I call the first function like this: cal_freq_pl<<<M,512>>>( ... ); M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.

But when I call the 2nd function in a similar way: cal_sum_vfreq_pl<<<M,512>>>( ... ); it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );, which is 128 less than 512. It shows no error with 512, but incorrect result.

I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.

I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).

Can someone please suggest some possible solution?

Thanks in advance...

EDIT: Here is the kernel code:

__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
    int index = threadIdx.x;
    int m = blockIdx.x;
    int block_dim = blockDim.x;
    int kv = *k; int vv = kv-1; int wv = kv-2;
    int woffset = index*wv;
    int no_vstrings = pow_pl(4, vv);
    float temppp=0;
    char wI[20], Iw[20]; int Iwi, wIi;
    for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
    for(int l=0;l<4;l++){
            Iw[0] = get_nucleotide_pl(l);
            wI[vv-1] = get_nucleotide_pl(l);
            Iwi = binary_search_pl(vstrings, Iw, vv);
            wIi = binary_search_pl(vstrings, wI, vv);
            temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
    }
    freq[index + m*block_dim] = 0.5*temppp;
}
talonmies
  • 70,661
  • 34
  • 192
  • 269
vegeta
  • 105
  • 4
  • 10
  • What does "it does not work" mean? Is there an kernel launch failure or is the kernel started and chrashes inside? Maybe just a bondary check fails inside the kernel. – hubs Feb 05 '13 at 10:46
  • @hubs: Starting at 385 it starts giving wrong output, it's just weird. – vegeta Feb 05 '13 at 10:50
  • If it starts with 385 threads and the result will be wrong, than pretty sure there will be a failure in your source code of the kernel. But it's hard to say without seeing your kernel code. – hubs Feb 05 '13 at 10:55
  • @hubs: No I mean, if I put blockDim = 385 the output is wrong, and the entire output is wrong, not just starting from 385 but all M*512 values. I'll add the kernel code... – vegeta Feb 05 '13 at 11:01
  • 1
    As a general rule you should check in any case if a thread is going to access within the memory you have allocated. Try run your program with `cuda-memcheck` – pQB Feb 05 '13 at 11:12
  • thanx although this was not a problem.. – vegeta Feb 05 '13 at 12:17

1 Answers1

1

It seems you allocated a lot of registers in the second kernel. You can not always reach the max threads per block due to the hardware resource limitation such as register number per block.

CUDA provides a tool to help calculate the proper nember of threads per block.

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

You can also find this .xls file in your CUDA installation dir.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • Oh, I'll look into that. One quick question, does char[20] count as 1 or 20 registers? – vegeta Feb 05 '13 at 11:28
  • It seems `binary_search_pl( ... );` is the root of all problems. I somehow need to replace it with something so that the threads/block do not go down to 384. I'm still not sure where the register variables are being used, but it has got something todo with this `__device__` function. As the answer in the above comment suggests long char arrays are automatically transferred to local memory and I tested it too, char[20] are not a problem, but `__device__` function calls are (binary_search_pl in this case), they use too many register variables. – vegeta Feb 05 '13 at 12:21
  • @user1961040, there are compile options that can output the regester usage for you. Alternatively I often use Visual profile to check the register usage. – kangshiyin Feb 05 '13 at 12:38
  • I've just discovered a new thing. So the number of registers is reduced due to the kernel code (and `__device__` function calls). But the wrong output I'm getting is not due to some miscalculation of the kernel code. The code isn't even compiling, but VS2008 shows no error, even with --ptxas-options=-v option. So the code that runs is the last successfully compiled version. Whatever change I make (with no. of threads > 384) is not compiled, so a previous version runs. I am running into this problem quite often so thought to write it here for somebody facing a similar issue. – vegeta Feb 08 '13 at 12:11