I making a CUDA program and am stuck at a problem. I have two functions:
__global__ void cal_freq_pl(float *, char *, char *, int *, int *)
__global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)
I call the first function like this:
cal_freq_pl<<<M,512>>>( ... );
M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.
But when I call the 2nd function in a similar way:
cal_sum_vfreq_pl<<<M,512>>>( ... );
it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );
, which is 128 less than 512. It shows no error with 512, but incorrect result.
I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.
I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).
Can someone please suggest some possible solution?
Thanks in advance...
EDIT: Here is the kernel code:
__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
int index = threadIdx.x;
int m = blockIdx.x;
int block_dim = blockDim.x;
int kv = *k; int vv = kv-1; int wv = kv-2;
int woffset = index*wv;
int no_vstrings = pow_pl(4, vv);
float temppp=0;
char wI[20], Iw[20]; int Iwi, wIi;
for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
for(int l=0;l<4;l++){
Iw[0] = get_nucleotide_pl(l);
wI[vv-1] = get_nucleotide_pl(l);
Iwi = binary_search_pl(vstrings, Iw, vv);
wIi = binary_search_pl(vstrings, wI, vv);
temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
}
freq[index + m*block_dim] = 0.5*temppp;
}