I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0
's with or without keeping the order.
Example:
Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}
Host Memory Result: {17, 13, 14, 0, 0, ...}
The simplest way is to use the host to remove the 0
's in O(n)
time. But considering I have around 1000
elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.
The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.
An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:
kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
if (array[threadId.x] == 0)
return;
for (int i = 0; i < arraySize; i++) {
array = arr[threadId.x];
__threadfence();
// If we were the lucky thread we won!
// kill the thread and continue re-reincarnated in a different thread
if (array[i] == arr[threadId.x])
return;
}
}
This method has only benefit in that we would perform in O(f(x))
time, where f(x)
is the average number of non-zero values there are in an array (f(x) ~= ln(n)
for my implementation, thus O(ln(n))
time, but has a high O
constant)
Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n))
relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).
So I'm not quite sure which method would be the fastest, and I still think there's a better way of handling this. Any suggestions?