Why only a small number of threads are launching?

Question

The code is for parallelizing string matching algorithm (brute force). Why only 5 threads are launching instead of 17?

#include<stdio.h>
#include<string.h>

__global__ void pattern_search(char* d_txt,char* d_pat,int* d_result,int N,int M){
    int id=threadIdx.x+blockIdx.x*blockDim.x;
    if(id<=(N)){
        int j=0;
        for(j=0;j<M;j++){
            if(d_txt[id+j]!=d_pat[j]){
                break;
            }
        }
        //if(j==M){
            d_result[id]=id;
        //}

    }
}

int main(){
    char txt[]="AABAACAADAABAAABAA";
    char pat[]="AABA";

    int N=strlen(txt);
    int M=strlen(pat);
    char* d_pat;
    cudaMalloc((void **)&d_pat,M);
    char* d_txt;
    cudaMalloc((void **)&d_txt,N);

    int result[N];
    for(int i=0;i<N;i++){
        result[i]=0;
    }

    int* d_result;
    cudaMalloc((void**)&d_result,N);
    cudaMemcpy(d_txt,txt,N,cudaMemcpyHostToDevice);
    cudaMemcpy(d_pat,pat,M,cudaMemcpyHostToDevice);
    cudaMemcpy(d_result,result,N,cudaMemcpyHostToDevice);
    pattern_search<<<1,50>>>(d_txt,d_pat,d_result,N,M);

    cudaMemcpy(result,d_result,N,cudaMemcpyDeviceToHost);

    for(int k=0;k<N;k++){

        printf("pattern found at:%d\n",result[k]);
    }
}

score 1 · Answer 1 · edited May 23 '17 at 10:27

Any time you are having trouble with a CUDA code, it's good practice to demonstrate and use proper cuda error checking and run your code with cuda-memcheck. If you had run this code with cuda-memcheck, it would have reported errors that might be instructive.
It's not clear why you would ask about 17 threads. In CUDA, the number of threads launched can be deduced from the kernel launch, and in brief it is the product of the first two numbers in the kernel launch configuration (<<<...>>>):
```
pattern_search<<<1,50>>>(d_txt,d_pat,d_result,N,M);
```
so in this case, it should launch 50 threads. Even if your question is restricted to N, the number N for your posted code is 18, not 17.
cudaMalloc, like host-side malloc, allocates memory in bytes. Therefore usage like this is not correct for this case:
```
cudaMalloc((void**)&d_result,N);
```
and instead you should do something like this:
```
cudaMalloc((void**)&d_result,N*sizeof(int));
```
since in this case you want to store N quantities of size int. This error was discoverable with cuda-memcheck which would report invalid __global__ writes as a result of this allocation error.
A similar problem occurs on your cudaMemcpy calls, which also operate on bytes (just like host memcpy). Instead of this:
```
cudaMemcpy(d_result,result,N,cudaMemcpyHostToDevice);
```
we want this:
```
cudaMemcpy(d_result,result,N*sizeof(int),cudaMemcpyHostToDevice);
```
and a similar correction needs to be made on the call after the kernel.
Your kernel has an out-of-bounds indexing error:
```
if(id<=(N)){
    int j=0;
    for(j=0;j<M;j++){
        if(d_txt[id+j]!=d_pat[j]){
```
the above code will allow the for-loop to index beyond the end of the array d_txt which is limited to length N. To fix this we can restrict the loop behavior to only operate if there is enough "indexing room" for the full iteration of the j-loop:
```
if((id+M)<=(N)){
    int j=0;
    for(j=0;j<M;j++){
        if(d_txt[id+j]!=d_pat[j]){
```
(and there are probably many other ways to address this as well.) This error could have been spotted by the invalid __global__ reads reported by cuda-memcheck.

The following code has the above items addressed, and runs without error for me:

$ cat t964.cu
#include<stdio.h>
#include<string.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


__global__ void pattern_search(char* d_txt,char* d_pat,int* d_result,int N,int M){
    int id=threadIdx.x+blockIdx.x*blockDim.x;
    if((id+M)<=(N)){
        int j=0;
        for(j=0;j<M;j++){
            if(d_txt[id+j]!=d_pat[j]){
                break;
            }
        }
        //if(j==M){
            d_result[id]=id;
        //}

    }
}

int main(){
    char txt[]="AABAACAADAABAAABAA";
    char pat[]="AABA";

    int N=strlen(txt);
    int M=strlen(pat);
    char* d_pat;
    cudaMalloc((void **)&d_pat,M);
    char* d_txt;
    cudaMalloc((void **)&d_txt,N);

    int result[N];
    for(int i=0;i<N;i++){
        result[i]=0;
    }

    int* d_result;
    cudaMalloc((void**)&d_result,N*sizeof(int));
    cudaMemcpy(d_txt,txt,N,cudaMemcpyHostToDevice);
    cudaMemcpy(d_pat,pat,M,cudaMemcpyHostToDevice);
    cudaMemcpy(d_result,result,N*sizeof(int),cudaMemcpyHostToDevice);
    cudaCheckErrors("1");
    pattern_search<<<1,50>>>(d_txt,d_pat,d_result,N,M);

    cudaMemcpy(result,d_result,N*sizeof(int),cudaMemcpyDeviceToHost);
    cudaCheckErrors("2");
    for(int k=0;k<N;k++){

        printf("pattern found at:%d\n",result[k]);
    }
}

You haven't indicated exactly what output you expect, but the results seem plausible to me.

Note that my applied fix for item 5 above means that only the first N-M+1 threads will report results. If you wanted some sort of different behavior (not sure exactly what sort of pattern matching you want) there would be other ways to modify this, of course.

In the future, if you want to avoid down-votes and close-votes, my suggestion would be to note that your question is essentially asking for debugging assistance ("Why isn't this code working?") and so it falls into a category of questions for which SO expects an MCVE. You provided a complete code, which is good. There are probably a few other things that could be improved:

State more clearly what the problem is. The question about how many threads are running is not very clear. For a good MCVE, you should explain what the expected results are and show the actual results. In some cases it may also be useful to mention the CUDA version, the compile command line, and the platform (host operating system) you are using.
Demonstrate and use proper cuda error checking and use of cuda-memcheck. Even if you don't understand the error output, describe or include it in your question -- it will be useful for others who are trying to help you.

Why only a small number of threads are launching?

1 Answers1