CUDA stream compaction algorithm

Question

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.

Example:

Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}

Host Memory Result: {17, 13, 14, 0, 0, ...}

The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.

The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.

An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:

kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
    if (array[threadId.x] == 0)
        return;

    for (int i = 0; i < arraySize; i++) {

         array = arr[threadId.x];

         __threadfence();

         // If we were the lucky thread we won! 
         // kill the thread and continue re-reincarnated in a different thread
         if (array[i] == arr[threadId.x])
             return;
    }
}

This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)

Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).

So I'm not quite sure which method would be the fastest, and I still think there's a better way of handling this. Any suggestions?

The algorithm is call stream compaction and this is a solved problem with good theoretical analyses and several very high performance of off the shelf implementations available via the search engine of your choice. — talonmies, Dec 03 '15 at 07:25
For the record, this operation is also called "left packing", as in [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) (for x86 using SIMD on the CPU, not GPU). — Peter Cordes, May 11 '21 at 18:11

score 13 · Accepted Answer · edited Jun 14 '21 at 02:20

13

What you are asking for is a classic parallel algorithm called stream compaction¹.

If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.

Rough sketch:

#include <thrust/copy.h>

template<typename T>
struct is_non_zero {
    __host__ __device__
    auto operator()(T x) const -> bool {
        return x != 0;
    }
};

// ... your input and output vectors here

thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());

If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.

⁽¹⁾ Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?

edited Jun 14 '21 at 02:20

KansaiRobot

7,564
11
71
150

answered Dec 03 '15 at 07:28

1

Referring to (1), Yeah, I'm curious as well. I like the idea of counting the number of non-zero elements, and then creating a specific map function for the total number of non-zero elements. Typically this map function is the prefix sum, but considering we don't need to order it, is there a function that is calculated faster? – Dane Bouchie Dec 03 '15 at 07:54
I guess it's akin to a "partial" stream compaction, where order is preserved locally across a warp (or block, depending on your implementation), but not globally. In other words this mean the prefix sum is also partial. Yes, it would be faster in the sense that the algorithm finishes earlier (there are less operations to do, since a global order is not required). Whether this would lead to *actual* better performance is hard to say, but it's a reasonable expectation that it would. – Dec 03 '15 at 08:47
1

Regarding literature, [here is a July 2017 article](https://www.jstage.jst.go.jp/article/ijnc/7/2/7_208/_pdf/-char/ja), with source code, that reports faster than thrust::copy_if. – Tyson Hilmer Jun 18 '18 at 11:38

Davide Spataro · Answer 2 · 2022-04-12T07:09:38.660

7

Stream compaction is a well-known problem for which a lot of code was written (Thrust, Chagg to cite two libraries that implement stream compaction on CUDA).

If you have a relatively new CUDA-capable device that supports intrinsic function as __ballot (compute capability >= 3.0) it is worth trying a small CUDA procedure that performs stream compaction much faster than Thrust.

Here finds the code and minimal doc. https://github.com/knotman90/cuStreamComp

It uses a ballotting function in a single kernel fashion to perform the compaction.

Edit:

I wrote an article explaining the inner workings of this approach. You can find it here if you are interested.

edited Apr 12 '22 at 07:09

answered Jan 06 '16 at 17:55

Davide Spataro

7,319
1
24
36

would this work well on a Jetson device? How about a NVIDIA Drive machine? – KansaiRobot Jun 14 '21 at 00:52
Also, the code has no license that I can see. Can this be used in other peopke's code? – KansaiRobot Jun 14 '21 at 10:55
@KansaiRobot have not tried it on Jetson devices and it is released with LGPL-3.0 license. – Davide Spataro Jun 15 '21 at 06:49

score 5 · Answer 3 · answered Feb 07 '17 at 08:22

With this answer, I'm only trying to provide more details to Davide Spataro's approach.

As you mentioned, stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.

The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not. The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.

The approach in [1,2] is an alternative to Thrust's copy_if mentioned above and consists of three steps:

Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size. The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred. As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.
Step #2. An exclusive scan operation is performed on the array d_BlockCounts. As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.
Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.

Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.

For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons: 1) It is specifically tailored to cards supporting warp intrinsic elements; 2) The approach does not guarantee the output ordering.

It should be noticed that we have tested the approach also against the code available at inkc.sourceforge.net. Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.

The full code is available here and is slightly optimized as compared to the original Davide Spataro's routine.

[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 - 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

Circa CUDA Toolkit 9 and the Volta architecture, the warp intrinsics used in both JackOLantern's and Spataro's code were deprecated for explicit warp level synchronization forms. Using an all-warp mask of 0xFFFFFFFF for these got the aforementioned code running, and verified correct (on my end). Thanks guys :) — Tyson Hilmer, Sep 21 '19 at 10:56
Steps #1 and #3 both evaluate the predicate (redundantly). If your predicate is expensive, consider modifying #1 to write the result and #3 to read it. — Tyson Hilmer, Sep 21 '19 at 10:58

CUDA stream compaction algorithm

3 Answers3

Edit:

Linked

Related