high performance calculations and saving of the threads identificators

Question

I write grid-stride loop to have High Performance Calculations, where large N, for example long long N 1<<36, or even more. From total grid I need only some indexes, which have to satisfy the define condition.

__global__ void Indexes(int *array, int N) {
int  index  = blockIdx.x * blockDim.x + threadIdx.x;
while( index<N)
    {
       if (condition)
       {....//do something to save index in array}  
    index += blockDim.x * gridDim.x;            
    }
}

Of course, it is possible use the Thrust, which allows to have both host and device arrays. But in this case obviously the calculation will be extremely ineffective, because need firstly to create a lot of non-needed elements, then to delete these.

What is the most effective way to save the indexes directly in array in device to pass in CPU?

Does this answer your question? [How to remove zero values from an array in parallel](https://stackoverflow.com/questions/12463693/how-to-remove-zero-values-from-an-array-in-parallel) — Ken Y-N, Jan 09 '20 at 05:05
Note that the special case, removing zeroes as described in the dup, is covered by the general case of choosing elements that satisfy a predicate. — Ken Y-N, Jan 09 '20 at 05:06
to remove zero values from array, it needs firstly to save these in this array. Identificator this is "index" in my example, and it is not in array. To know the size of array, which we need for storing of needed indexes, we can use two pass algortithm. We create array on CPU with knowen size and then send it on device. But how to save identificators of threads in this array? — Konstantin, Jan 09 '20 at 06:28
If you [study the example](https://github.com/thrust/thrust/blob/master/examples/stream_compaction.cu) you can see how to do this for a trivial `is_odd` check. Note that Thrust allows you to have both host and device arrays. Another way to mostly keep existing code is to have an array of booleans, then use [this `copy_if()`](http://thrust.github.io/doc/group__stream__compaction_ga36d9d6ed8e17b442c1fd8dc40bd515d5.html#ga36d9d6ed8e17b442c1fd8dc40bd515d5) to copy only the matched conditions.. — Ken Y-N, Jan 09 '20 at 06:35
Thanks Ken Y-N. Is it possbile N 1<<36 for https://github.com/thrust/thrust/blob/master/examples/stream_compaction.cu — Konstantin, Jan 09 '20 at 08:17
That's 64 GB!? With compiler support, that's possible, I would. BTW, note that it is [very easy to copy between host and device memory](https://github.com/thrust/thrust/blob/master/examples/cpp_integration/device.cu) as all pointers may be to either kind of memory. Note, I've not actually used Thrust... — Ken Y-N, Jan 09 '20 at 08:28

talonmies · Answer 1 · 2020-04-17T09:48:57.983

If your output is relatively dense (i.e. a lot of indices and relatively few zeros), then the stream compaction approach suggested in comments is a good solution. There are a lot of ready-to-go stream compaction implementations which you can probably adapt to your purposes.

If your output is sparse, so you need to save relatively few indices for a lot of inputs, then stream compaction isn't such a great solution because it will waste a lot of GPU memory. In that case (and you can roughly estimate an upper bound of the number of output indices) something like this:

template <typename T>
struct Array 
{
    T*  p;
    int Nmax;
    int* next;  

    Array() = default;

    __host__ __device__ 
    Array(T* _p, int _Nmax, int* _next) : p(_p), Nmax(_Nmax), next(_next) {};

    __device__
    int append(T& val)
    {
        int pos = atomicAdd(next, 1);
        if (pos > Nmax) {
            atomicExch(next, Nmax);
            return -1;
        } else {           
            p[pos] = val;
            return pos;
        }
    };
};

is probably more appropriate. Here, the idea is to use an atomically incremented position in the output array to keep track of where a thread should store its index. The code will signal if you fill the index array, and there will be information from which you can work out a restart strategy to stop the current kernel and then start from the last known index which you were able to store.

A complete example:

$ cat append.cu 

#include <iostream>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>

namespace AppendArray
{
    template <typename T>
    struct Array 
    {
        T*  p;
        int Nmax;
        int* next;  

        Array() = default;

        __host__ __device__ 
        Array(T* _p, int _Nmax, int* _next) : p(_p), Nmax(_Nmax), next(_next) {};

        __device__
        int append(T& val)
        {
            int pos = atomicAdd(next, 1);
            if (pos > Nmax) {
                atomicExch(next, Nmax);
                return -1;
            } else {           
                p[pos] = val;
                return pos;
            }
        };
    };
}

    __global__ 
void kernelfind(int* input, int N, AppendArray::Array<int> indices)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    for(; idx < N; idx += gridDim.x*blockDim.x) {
        if (input[idx] % 10000 == 0) {
            if (indices.append(idx) < 0) return;
        }
    }
}

int main()
{
    const int Ninputs =  1 << 20;
    thrust::device_vector<int> inputs(Ninputs);
    thrust::counting_iterator<int> vals(1);
    thrust::copy(vals, vals + Ninputs, inputs.begin());
    int* d_input = thrust::raw_pointer_cast(inputs.data());

    int Nindices =  Ninputs >> 12;
    thrust::device_vector<int> indices(Nindices);
    int* d_indices = thrust::raw_pointer_cast(indices.data());

    int* pos; cudaMallocManaged(&pos, sizeof(int)); *pos = 0;

    AppendArray::Array<int> index(d_indices, Nindices-1, pos);

    int gridsize, blocksize;
    cudaOccupancyMaxPotentialBlockSize(&gridsize, &blocksize, kernelfind, 0, 0);

    kernelfind<<<gridsize, blocksize>>>(d_input, Ninputs, index);
    cudaDeviceSynchronize();

    for(int i = 0; i < *pos; ++i) {
        int idx = indices[i];
        std::cout << i << " " << idx << "  " << inputs[idx] << std::endl;   
    }
    return 0;
}

$ nvcc -std=c++11 -arch=sm_52 -o append append.cu

$ ./append
0 9999  10000
1 19999  20000
2 29999  30000
3 39999  40000
4 49999  50000
5 69999  70000
6 79999  80000
7 59999  60000
8 89999  90000
9 109999  110000
10 99999  100000
11 119999  120000
12 139999  140000
13 129999  130000
14 149999  150000
15 159999  160000
16 169999  170000
17 189999  190000
18 179999  180000
19 199999  200000
20 209999  210000
21 219999  220000
22 239999  240000
23 249999  250000
24 229999  230000
25 279999  280000
26 269999  270000
27 259999  260000
28 319999  320000
29 329999  330000
30 289999  290000
31 299999  300000
32 339999  340000
33 349999  350000
34 309999  310000
35 359999  360000
36 379999  380000
37 399999  400000
38 409999  410000
39 369999  370000
40 429999  430000
41 419999  420000
42 389999  390000
43 439999  440000
44 459999  460000
45 489999  490000
46 479999  480000
47 449999  450000
48 509999  510000
49 539999  540000
50 469999  470000
51 499999  500000
52 569999  570000
53 549999  550000
54 519999  520000
55 589999  590000
56 529999  530000
57 559999  560000
58 619999  620000
59 579999  580000
60 629999  630000
61 669999  670000
62 599999  600000
63 609999  610000
64 699999  700000
65 639999  640000
66 649999  650000
67 719999  720000
68 659999  660000
69 679999  680000
70 749999  750000
71 709999  710000
72 689999  690000
73 729999  730000
74 779999  780000
75 799999  800000
76 809999  810000
77 739999  740000
78 849999  850000
79 759999  760000
80 829999  830000
81 789999  790000
82 769999  770000
83 859999  860000
84 889999  890000
85 879999  880000
86 819999  820000
87 929999  930000
88 869999  870000
89 839999  840000
90 909999  910000
91 939999  940000
92 969999  970000
93 899999  900000
94 979999  980000
95 959999  960000
96 949999  950000
97 1019999  1020000
98 1009999  1010000
99 989999  990000
100 1029999  1030000
101 919999  920000
102 1039999  1040000
103 999999  1000000

Dear talonmies, exuse me, if use const long long int Ninputs=1 << 36 terminate called after throwing an instance of 'thrust::system::detail::bad_alloc' what(): std::bad_alloc: out of memory Aborted (core dumped) — Konstantin, Jan 10 '20 at 11:24
@Konstantin: Your GPU doesn't have 64Gb of memory and you have run out.... — talonmies, Jan 10 '20 at 11:31

high performance calculations and saving of the threads identificators

1 Answers1