Parallelization of a for loop consisting of Thrust Transforms

Question

I've implemented a for loop consisting of several Thrust transformations. My aim is to calculate r[i] for each value of i from 0 to N. To put simply, r is a column vector and each of its elements can be calculated independently.

Therefore, I'm looking a way of parallelizing the for loop given below:

for(int i=0; i < N; i++) {
   thrust::device_vector<float> P(N, 0.0);  
   thrust::device_vector<int> corr_col_indices_d(col_indices.begin() + row_begin[i], col_indices.begin() + row_begin[i+1]); // indices of the columns
   thrust::device_vector<float> corr_values_d(values_d.begin() + row_begin[i], values_d.begin() + row_begin[i+1]); // values of the columns

    // P[j] = corr_values_d[k] if j is in corr_col_indices_d, else 0  (increment k if j is in corr_col_indices_d)
    thrust::scatter(corr_values_d.begin(), corr_values_d.end(), corr_col_indices_d.begin(), P.begin());
    r2[i] = thrust::inner_product(P.begin(), P.end(), r1.begin(), 0.0f);
}

1) After lots of googling, roaming around Stackoverflow and NVIDIA, I attempted to put all successive transformations into a bigger "transform" with a loop variable i.

auto counting_iter = thrust::make_counting_iterator(0);
thrust::transform(counting_iter, counting_iter + N, r2.begin(), [&](int i) {
    thrust::device_vector<float> P(N, 0.0);  
    thrust::device_vector<int> corr_col_indices_d(col_indices.begin() + row_begin[i], col_indices.begin() + row_begin[i+1]); /
    thrust::device_vector<float> corr_values_d(values_d.begin() + row_begin[i], values_d.begin() + row_begin[i+1]); 
    thrust::scatter(corr_values_d.begin(), corr_values_d.end(), corr_col_indices_d.begin(), P.begin());
    thrust::transform(P.begin(), P.end(), r1.begin(), P.begin(), thrust::multiplies<float>());
    return thrust::reduce(P.begin(), P.end());
});

Unfortunately it doesn't work. Either the there is no such a thing as giving transformations like this, or my syntax is wrong.

2) Then I tried to create a functor that takes all these device_vectors as input and operates on them. As stated here, it's not possible to pass device_vectors to a functor from outside - therefore I attempted to give them as raw pointers.

struct loop {
    // constructor that takes a vector as a parameter
    __host__ __device__
    loop(int *t_row_begin, int *t_col_indices, float*t_values, float *r1): 
        t_row_begin_(t_row_begin), t_col_indices_(t_col_indices), t_values_(t_values), r1_(r1) {}

    // member variable to store the vector
    int *t_row_begin_;
    int *t_col_indices_;
    float *t_values_;
    float *r1_;

    __host__ __device__
    float operator()(int i) const {
        thrust::device_vector<float> P(N, 0.0);  
        thrust::device_vector<int> corr_col_indices_d(t_col_indices_ + t_row_begin_[i], t_col_indices_ + t_row_begin_[i + 1]); // indices of the columns
        thrust::device_vector<float> corr_values_d(t_values_ + t_row_begin_[i], t_values_ + t_row_begin_[i+1]); // values of the columns

        thrust::scatter(corr_values_d.begin(), corr_values_d.end(), corr_col_indices_d.begin(), P.begin());
        return thrust::inner_product(P.begin(), P.end(), r1.begin(), 0.0f);
    }
};

and the loop itself:


loop lp(thrust::raw_pointer_cast(row_begin_d.data()), 
            thrust::raw_pointer_cast(col_indices_d.data()), 
            thrust::raw_pointer_cast(values_d.data()), 
            thrust::raw_pointer_cast(r1.data()));
auto iter = thrust::make_counting_iterator(0);
// perform the operations for each iteration of the loop using transform
thrust::transform(iter, iter + N, r2.begin(), lp);

3) I even tried passing arguments to operator rather than the constructor of the functor:

struct loop {
    __host__ __device__
    float operator()(int i, thrust::device_vector<int>& col_indices, thrust::device_vector<float>& values_d, thrust::device_vector<int>& row_begin, thrust::device_vector<float>& r1) const {
        thrust::device_vector<float> P(N, 0.0);  
        thrust::device_vector<int> corr_col_indices_d(col_indices.begin() + row_begin[i], col_indices.begin() + row_begin[i+1]); // indices of the columns
        thrust::device_vector<float> corr_values_d(values_d.begin() + row_begin[i], values_d.begin() + row_begin[i+1]); // values of the columns
        thrust::scatter(corr_values_d.begin(), corr_values_d.end(), corr_col_indices_d.begin(), P.begin());
        return thrust::inner_product(P.begin(), P.end(), r1.begin(), 0.0f);
    }
};

auto iter = thrust::make_counting_iterator(0);
thrust::transform(iter, iter + N, r2.begin(), 
   thrust::make_transform_iterator(iter, loop()), 
   thrust::make_zip_iterator(thrust::make_tuple(col_indices, values_d, row_begin, r1)));

None of them compiles and all those complicated error messages don't really help. So, I'm looking for some assistance at this point.

CUDA version: 11.2
Thrust version: 1.10.0

Edit: In case you wonder, those vectors correspond to components of CSR matrix representation:

vector<int> row_begin;
vector<float> values;
vector<int> col_indices;

Updates

Fused transform and reduce to inner_product. as suggested by @paleonix.

You can't use `device_vector` in device code **at all**. Apart from that, nesting parallel algorithms like this is deprecated in newer versions of Thrust (not the one you are using) due to the new CUDA Dynamic Parallelism API in CUDA 12 (and the inefficiency of the old API). — paleonix, Jan 05 '23 at 12:59
A first step would be getting allocations out of the loop (reusing the vectors) and fusing `transform` and `reduce` into one `inner_product`. — paleonix, Jan 05 '23 at 13:02
As you only read the scattered data once, you can also do it implicitly using a `permutation_iterator`. — paleonix, Jan 05 '23 at 13:07
The whole idea of scattering the values from a small row (sparse matrix in CSR format) into a big vector seems very wasteful to me. Instead I would use a permutation iterator to get only the values from `r1` that you need. — paleonix, Jan 05 '23 at 13:17
At that point the parallelism in these algorithms is probably very small, so you could use the `seq` execution policy and then use them inside `transform` over the rows as you wanted to do from the start. — paleonix, Jan 05 '23 at 13:18
Thank you for the suggestions. But getting the allocation of `P` out of the for loop causes program to output wrong values (`P` represents the row vectors and its values must be refreshed each time the loop starts). Additionally I can't figure out how to use `permutation_iterator`. That would be great if you could provide your suggestions as answer with some code. — Muhteva, Jan 05 '23 at 14:04

paleonix · Answer 1 · 2023-02-10T12:43:17.083

To show my thought process when refactoring your code, I provide you with multiple steps/intermediate solutions:

Get rid of allocations inside the loop, they are expensive while you do not need copies of the rows and P can be reused:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/scatter.h>
#include <thrust/inner_product.h>

void foo(int N,
         thrust::host_vector<int> const &row_begin,
         thrust::device_vector<int> const &col_indices,
         thrust::device_vector<float> const &values_d,
         thrust::device_vector<float> const &r1,
         thrust::host_vector<float> &r2) {

    thrust::device_vector<float> P(N);
    for(int i = 0; i < N; ++i) {
        thrust::fill(P.begin(), P.end(), 0.0f);

        // P[j] = corr_values_d[k] if j is in corr_col_indices_d, else 0  (increment k if j is in corr_col_indices_d)
        thrust::scatter(values_d.cbegin() + row_begin[i],
                        values_d.cbegin() + row_begin[i+1],
                        col_indices.cbegin() + row_begin[i],
                        P.begin());

        r2[i] = thrust::inner_product(P.cbegin(), P.cend(),
                                      r1.cbegin(),
                                      0.0f);
    }
}

Use a permutation iterator on r1 instead of scattering the values into P. This is much more efficient as it avoids unnecessary memory accesses. The scattered and therefore non-coalesced access is expensive, so if you would access the data more than once in this fashion, using thrust::scatter could be better again.

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/inner_product.h>
#include <thrust/iterator/permutation_iterator.h>

void foo(int N,
         thrust::host_vector<int> const &row_begin,
         thrust::device_vector<int> const &col_indices,
         thrust::device_vector<float> const &values_d,
         thrust::device_vector<float> const &r1,
         thrust::host_vector<float> &r2) {

    auto const r1_iter =
        thrust::make_permutation_iterator(
            r1.cbegin(),
            col_indices.cbegin());

    for(int i = 0; i < N; ++i) {
        r2[i] =
            thrust::inner_product(
                values_d.cbegin() + row_begin[i],
                values_d.cbegin() + row_begin[i+1],
                r1_iter + row_begin[i],
                0.0f);
    }
}

There is not much parallelism left in the inner_product. So do it sequentially and parallelize the outer loop:

#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/transform.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>

void foo(int N,
         thrust::device_vector<int> const &row_begin,
         thrust::device_vector<int> const &col_indices,
         thrust::device_vector<float> const &values_d,
         thrust::device_vector<float> const &r1,
         thrust::device_vector<float> &r2) {

    auto const row_begin_ptr = row_begin.data();
    auto const col_indices_ptr = col_indices.data();
    auto const values_d_ptr = values_d.data();
    auto const r1_iter =
        thrust::make_permutation_iterator(
            r1.cbegin(),
            col_indices.cbegin());

    thrust::transform(
        thrust::make_counting_iterator(0),
        thrust::make_counting_iterator(0) + N,
        r2.begin(),
        [=] __host__ __device__ (int i){
            return thrust::inner_product(thrust::seq,
                                         values_d_ptr + row_begin_ptr[i],
                                         values_d_ptr + row_begin_ptr[i+1],
                                         r1_iter + row_begin_ptr[i],
                                         0.0f);
        });
}

While above solution should be sufficient for e.g. banded matrices where the rows are each very small and regular, irregularities like single long rows will make this solution quite inefficient again due to work-imbalance and warp divergence. The alternative is to use a segmented/batched reduction as implemented by thrust::reduce_by_key. To use reduce_by_key here, one would need to "decompress" the CSR matrix (transforming row offsets into keys), even though Thrust might go back to row offsets under the hood either way to use CUB in the back-end. To avoid this inefficiency, I used CUB directly via cub::DeviceSegmentedReduce::Sum. To still fuse the transform/multiplication into the reduction, one can use a transform iterator. I also ditched the permutation iterator and implemented the gather directly in the transform iterator:

#include <cub/cub.cuh>

#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>

void foo(int N,
         thrust::device_vector<int> const &row_begin,
         thrust::device_vector<int> const &col_indices,
         thrust::device_vector<float> const &values_d,
         thrust::device_vector<float> const &r1,
         thrust::device_vector<float> &r2) {

    auto const col_indices_ptr = col_indices.data();
    auto const values_d_ptr = values_d.data();
    auto const r1_ptr = r1.data();

    auto const corr_iter =
        thrust::make_transform_iterator(
            thrust::make_counting_iterator(0),
            [=] __host__ __device__ (int j){
                return values_d_ptr[j] * r1_ptr[col_indices_ptr[j]];
            });
    
    // Determine temporary storage
    size_t temp_storage_bytes = 0;
    cub::DeviceSegmentedReduce::Sum(nullptr, temp_storage_bytes,
                                    corr_iter,
                                    r2.begin(),
                                    N,
                                    row_begin.cbegin(), row_begin.cbegin() + 1);
    // Allocate temporary storage
    thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
    // Run sum-reduction
    cub::DeviceSegmentedReduce::Sum(thrust::raw_pointer_cast(d_temp_storage.data()),
                                    temp_storage_bytes,
                                    corr_iter,
                                    r2.begin(),
                                    N,
                                    row_begin.cbegin(), row_begin.cbegin() + 1);
}

Avoiding Temporary Buffer Initialization (and Allocation)

The only thing missing in this last solution for "ideal" performance is that the temporary storage is unnecessarily initialized. This can be avoided by using a custom allocator as shown in the Thrust example uninitialized_vector.cu. I did not include it in above code to avoid the bloat.

An even nicer solution is the rmm::device_buffer from the RAPIDS Memory Manager, but this one is not included in the CUDA Toolkit.

In the future libcudac++ will hopefully give us a similarly nice C++ option, as they are working on memory resources at the moment.

If this operation is done repeatedly, one can also just reuse the temporary memory. Even when not using CUB directly, one can achieve this using a pool memory resource. See cuda/custom_temporary_allocation.cu and mr_basic.cu

@Muhteva I just added another, even better solution using CUB. I was using CUDA 11.2 on Compiler Explorer to match your (compilation) result. You need to specify `-extended-lambda` to `nvcc` to use device lambdas. — paleonix, Jan 05 '23 at 15:24
@Muhteva You can still write a similar solution using `thrust::reduce_by_key` (I wont do it here). You can scatter the row indices (counting iterator) into the keys vector and then use `thrust::inclusive_scan` with the `thrust::maximum` functor to populate all the keys. — paleonix, Jan 05 '23 at 15:43
@Muhteva not sure what would be wrong about the third solution. By "it doesn't complete the last transform operation", you mean that only `r2[N - 1]` is wrong? On [Compile Explorer](https://cuda.godbolt.org/z/7YPPsWPvn) it compiles without any warnings (concerning the lambda or otherwise). — paleonix, Jan 05 '23 at 16:01
@Muhteva I understand avoiding cusp as it doesn't seem to be maintained anymore. But if you use Thrust for GPU computing, CUB is always available and it is maintained by the same people as Thrust. I don't see a good reason not to use it here. — paleonix, Jan 05 '23 at 20:22

Parallelization of a for loop consisting of Thrust Transforms

Updates

1 Answers1

Avoiding Temporary Buffer Initialization (and Allocation)

Linked