2

My application do some stuff in device-code and generates an array inside the kernel.

I need to search the first occurrence of an element in this array. How can i perform it in GPU? If i copy the array to CPU and do the work there, it will generate so much memory traffic, because this piece of code is called many times.

Blufter
  • 97
  • 1
  • 12

2 Answers2

2

There is most probably a more sophisticated solution, but for a start and especially if the number of occurrences of the element is very small, a simple brute-force atomic-min might be a viable solution:

template<typename T> __global__ void find(T *data, T value, int *min_idx)
{
    int idx = threadIdx.x + blockDim.x*blockIdx.x;
    if(data[idx] == value)
        atomicMin(min_idx, idx);
}

If the number of occurrences is really small and thus nearly all threads don't even attempt to access the atomic, this might actually be not that bad a solution. Otherwise (if the searched element is not so rare) you would have much more intra-warp divergence and, even worse, a much higher probability of conflicting atomic operations.


EDIT: For a more sophisticated approach (but maybe still not the best) you could otherwise also in a pre-step create an int array with the value at index idx set to idx if the input array's element equals the searched element at that index, and INT_MAX if it doesn't:

indices[idx] = (data[idx]==value) ? idx : INT_MAX;

and then do a "classical" minimum-reduction on that index array to get the first matching index.

Christian Rau
  • 45,360
  • 10
  • 108
  • 185
  • I'll wait to see if someone else come with new ideas, otherwise i'll choose your answer as the best one. Thank you. – Blufter Jun 11 '13 at 15:20
  • Just a note: if your array is much bigger than the number of threads you have, you can perform just a step-wise reduction (i.e. load just one value per thread, do a reduction between the threads and only if the value is not found - iterate) – CygnusX1 Jun 11 '13 at 19:14
0

One approach is to use atomic operations which prevent other threads from accessing editable data until the one currently processing it is done.

Here's an example of finding first occurrence of a word: http://supercomputingblog.com/cuda/search-algorithm-with-cuda/ The atomicMin function is used in that example. In addition, there's also a performance comparison between GPU and CPU in the article.

Another way to find first occurrence is to use a method known as parallel reduction. There is an example of parallel sum in the CUDA SDK (the sample calculates sum of all values in an array). Parallel reduction is a good option especially if you use hardware with older compute capability version and if you need high precision.

To use parallel reduction to find first occurrence, you firstly check if the value in the array equals to what you want to find. If it does, you save its index. Then, you perform one or many min operations (not atomic min) where you compare the indices you saved in the previous step. You can implement this search by editing the parallel sum example of CUDA SDK.

This site has some information about reduction and atomic operations. It also includes binary tree reduction and workaround atomic functions that I haven't talked about here.

The atomic vs. reduction issue has also been discussed on Stack Overflow.

Community
  • 1
  • 1
user2448027
  • 1,628
  • 10
  • 11
  • Sorry, can explain better how to transform the parallel reduction into this case and you tell me what are these `min` and `max` functions? Please, can you link them from the documentation? Thank you :) – Gianluca Sep 16 '20 at 18:53