2

I am using GPU to do some calculation for processing words. Initially, I used one block (with 500 threads) to process one word. To process 100 words, I have to loop the kernel function 100 times in my main function.

for (int i=0; i<100; i++)
    kernel <<< 1, 500 >>> (length_of_word); 

My kernel function looks like this:

__global__ void kernel (int *dev_length)
{
   int length = *dev_length;
   while (length > 4)
   {   //do something;
          length -=4;
   }
}

Now I want to process all 100 words at the same time.

Each block will still have 500 threads, and processes one word (per block).

dev_totalwordarray: store all characters of the words (one after another)

dev_length_array: store the length of each word.

dev_accu_length: stores the accumulative length of the word (total char of all previous words)

dev_salt_ is an array of of size 500, storing unsigned integers.

Hence, in my main function I have

   kernel2 <<< 100, 500 >>> (dev_totalwordarray, dev_length_array, dev_accu_length, dev_salt_);

to populate the cpu array:

    for (int i=0; i<wordnumber; i++)
    {
        int length=0;
        while (word_list_ptr_array[i][length]!=0)
        {
            length++;
        }

        actualwordlength2[i] = length;
    }

to copy from cpu -> gpu:

    int* dev_array_of_word_length;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_array_of_word_length, 100 * sizeof(int) ) );
    HANDLE_ERROR( cudaMemcpy( dev_array_of_word_length, actualwordlength2, 100 * sizeof(int),

My function kernel now looks like this:

__global__ void kernel2 (char* dev_totalwordarray, int *dev_length_array, int* dev_accu_length, unsigned int* dev_salt_)
{

  tid = threadIdx.x + blockIdx.x * blockDim.x;
  unsigned int hash[N];

  int length = dev_length_array[blockIdx.x];

   while (tid < 50000)
   {
        const char* itr = &(dev_totalwordarray[dev_accu_length[blockIdx.x]]);
        hash[tid] = dev_salt_[threadIdx.x];
        unsigned int loop = 0;

        while (length > 4)
        {   const unsigned int& i1 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
            const unsigned int& i2 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
            hash[tid] ^= (hash[tid] <<  7) ^  i1 * (hash[tid] >> 3) ^ (~((hash[tid] << 11) + (i2 ^ (hash[tid] >> 5))));
            length -=4;
        }
        tid += blockDim.x * gridDim.x;
   }
}

However, kernel2 doesn't seem to work at all.

It seems while (length > 4) causes this.

Does anyone know why? Thanks.

user1807890
  • 21
  • 1
  • 4

1 Answers1

1

I am not sure if the while is the culprit, but I see few things in your code that worry me:

  • Your kernel produces no output. The optimizer will most likely detect this and convert it to an empty kernel
  • In almost no situation you want arrays allocated per-thread. That will consume a lot of memory. Your hash[N] table will be allocated per-thread and discarded at the end of the kernel. If N is big (and then multiplied by the total amount of threads) you may run out of GPU memory. Not to mention, that accessing the hash will be almost as slow as accessing global memory.
  • All threads in a block will have the same itr value. Is it intended?
  • Every thread initializes only a single field within its own copy of hash table.
  • I see hash[tid] where tid is a global index. Be aware that even if hash was made global, you may hit concurrency problems. Not all blocks within a grid will run at the same time. While one block will initialize a portion of hash, another block might not even start!
CygnusX1
  • 20,968
  • 5
  • 65
  • 109
  • actually, my kernel function receives more argument than what I have posted. It will receive another pointer to hashtable. So what I actually want my kernel to do is: 1. each thread calculates one hash value. 2. each block will calculate for one word, ie one word will have 500 hash values. 3. the hashtable (of size 50000) will be updated with all the values, then I will copy the updated hashtable back to cpu. Could you suggest how can I modify the code? Thanks. – user1807890 Nov 08 '12 at 08:54
  • I just realized that every thread is modifying only one cell: `hash[tid]` - is this the intended behavior? Also - all threads from the same block will read exactly the same thing from `dev_totalwordarray`. So, the output in `hash[tid]` will differ only by the content of `dev_salt_`; every thread will process whole word. – CygnusX1 Nov 08 '12 at 10:31
  • yes, you are right. Each thread will process the entire word, and threads from the same block will process the same word. – user1807890 Nov 08 '12 at 11:45
  • Then if this is what you intend - it should work. What hides behind "does not work at all"? Fail to launch? Wrong output? .... – CygnusX1 Nov 08 '12 at 15:45
  • there is no error when I compile, but it seems the kernel isn't being processed. I tried putting a printf line inside while (tid < 50000) , nothing is being printed (unless while(length>4) has been commented away). That's why I thought (length>4) is the main cause of the problem. – user1807890 Nov 09 '12 at 01:15
  • Did you check the error code returned from the kernel? Call `cudaDeviceSynchronize` after the kernel call and check the error code. Check manual (http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDART__DEVICE_gb76422145b5425829597ebd1003303fe.html#gb76422145b5425829597ebd1003303fe) – CygnusX1 Nov 09 '12 at 08:07
  • cudaDeviceSynchronize returns a 0. This means there is no error, is this right? – user1807890 Nov 12 '12 at 03:42
  • It is the first CUDA API call after the kernel right? Then it means no error. – CygnusX1 Nov 12 '12 at 07:50