Get unique elements of multiple arrays in CUDA

Question

Here is the problem: There number of arrays, for example, 2000 arrays, but only 256 integers in each array. And the range of the integers is quite considerable, [0, 1000000] for instance.

I want to get the unique elements for each array, in other words, remove the duplicate elements. I have 2 solutions:

Use Thrust to get the unique element for every array, so I have to do 2000 times thrust::unique. But each array is pretty small, this way may not get a good performance.
Implement hash table in cuda kernel, use 2000 blocks, 256 thread in each block. And make use of the shared memory to implement to hash table, then every single block will produce a element-unique array.

The above two methods seem unprofessional, are there elegant ways to solve the problem by CUDA ?

You could use `thrust::unique` inside a functor passed to `thrust::for_each`. The general methodology is outlined [here](http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix/28254765#28254765). — Robert Crovella, Oct 22 '15 at 14:15
is the range of the 256 integers limited somehow (e.g. `min=0, max=2^16`)? — m.s., Oct 22 '15 at 14:20
@m.s. No, the range of integers is real big. min =0 max=1000,000 (or larger). So it is impossible to use the integers as a array's base. — Pan.da, Oct 23 '15 at 00:58

score 4 · Accepted Answer · edited May 23 '17 at 10:30

You can use thrust::unique if you modify your data similar like it is done in this SO question: Segmented Sort with CUDPP/Thrust

For simplification, let's assume each array contains per_array elements and there is a total of array_num arrays. Each element is in the range [0,max_element].

Demo data with per_array=4, array_num=3 and max_element=2 could look like this:

data = {1,0,1,2},{2,2,0,0},{0,0,0,0}

To denote the membership of each element to the respective array we use the following flags:

flags = {0,0,0,0},{1 1 1 1},{2,2,2,2}

In order to get unique elements per array of the segmented dataset we need to do the following steps:

Transform data so the elements of each array i are within the unique range [i*2*max_element,i*2*max_element+max_element]
```
data = data + flags*2*max_element
data = {1,0,1,2},{6,6,4,4},{8,8,8,8}
```
Sort the transformed data:
```
data = {0,0,1,2},{4,4,6,6},{8,8,8,8}
```
Apply thrust::unique_by_key using data as keys and flags as values:
```
data  = {0,1,2}{4,6}{8}
flags = {0,0,0}{1,1}{2}
```

Transform data back to the original values:

data  = data - flags*2*max_element
data  = {0,1,2}{0,2}{0}

The maximum value of max_element is bounded by the size of the integer used for representing data. If it is an unsigned integer with n bits:

max_max_element(n,array_num) = 2^n/(2*(array_num-1)+1)

Given your array_num=2000, you will get the following limits for 32bit and 64bit unsigned integers:

max_max_element(32,2000) = 1074010
max_max_element(64,2000) = 4612839228234447

The following code implements the above steps:

unique_per_array.cu

#include <thrust/device_vector.h>
#include <thrust/extrema.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/copy.h>

#include <iostream>
#include <cstdint>

#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
    std::cout << name << ":\t";
    thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
    std::cout << std::endl;
}

int main()
{ 
    typedef uint32_t Integer;

    const std::size_t per_array = 4;
    const std::size_t array_num = 3;

    const std::size_t total_count = array_num * per_array;

    Integer demo_data[] = {1,0,1,2,2,2,0,0,0,0,0,0};

    thrust::device_vector<Integer> data(demo_data, demo_data+total_count);    

    PRINTER(data);

    // if max_element is known for your problem,
    // you don't need the following operation 
    Integer max_element = *(thrust::max_element(data.begin(), data.end()));
    std::cout << "max_element=" << max_element << std::endl;

    using namespace thrust::placeholders;

    // create the flags

    // could be a smaller integer type as well
    thrust::device_vector<uint32_t> flags(total_count);

    thrust::counting_iterator<uint32_t> flags_cit(0);

    thrust::transform(flags_cit,
                      flags_cit + total_count,
                      flags.begin(),
                      _1 / per_array);
    PRINTER(flags);


    // 1. transform data into unique ranges  
    thrust::transform(data.begin(),
                      data.end(),
                      thrust::counting_iterator<Integer>(0),
                      data.begin(),
                      _1 + (_2/per_array)*2*max_element);
    PRINTER(data);

    // 2. sort the transformed data
    thrust::sort(data.begin(), data.end());
    PRINTER(data);

    // 3. eliminate duplicates per array
    auto new_end = thrust::unique_by_key(data.begin(),
                                         data.end(),
                                         flags.begin());

    uint32_t new_size = new_end.first - data.begin();
    data.resize(new_size);
    flags.resize(new_size);

    PRINTER(data);
    PRINTER(flags);

    // 4. transform data back
    thrust::transform(data.begin(),
                      data.end(),
                      flags.begin(),
                      data.begin(),
                      _1 - _2*2*max_element);

    PRINTER(data);

}

Compiling and running yields:

$ nvcc -std=c++11 unique_per_array.cu -o unique_per_array && ./unique_per_array

data:   1   0   1   2   2   2   0   0   0   0   0   0   
max_element=2
flags:  0   0   0   0   1   1   1   1   2   2   2   2   
data:   1   0   1   2   6   6   4   4   8   8   8   8   
data:   0   1   1   2   4   4   6   6   8   8   8   8   
data:   0   1   2   4   6   8   
flags:  0   0   0   1   1   2   
data:   0   1   2   0   2   0

One more thing:

In the thrust development version there is an improvement implemented for thrust::unique* which improves performance by around 25 %. You might want to try this version if you aim for better performance.

score 0 · Answer 2 · answered Mar 10 '21 at 08:21

0

I think that thrust::unique_copy() can help you do this.

answered Mar 10 '21 at 08:21

黄小明

1

Get unique elements of multiple arrays in CUDA

2 Answers2