There is most probably a more sophisticated solution, but for a start and especially if the number of occurrences of the element is very small, a simple brute-force atomic-min might be a viable solution:
template<typename T> __global__ void find(T *data, T value, int *min_idx)
{
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if(data[idx] == value)
atomicMin(min_idx, idx);
}
If the number of occurrences is really small and thus nearly all threads don't even attempt to access the atomic, this might actually be not that bad a solution. Otherwise (if the searched element is not so rare) you would have much more intra-warp divergence and, even worse, a much higher probability of conflicting atomic operations.
EDIT: For a more sophisticated approach (but maybe still not the best) you could otherwise also in a pre-step create an int
array with the value at index idx
set to idx
if the input array's element equals the searched element at that index, and INT_MAX
if it doesn't:
indices[idx] = (data[idx]==value) ? idx : INT_MAX;
and then do a "classical" minimum-reduction on that index array to get the first matching index.