Assemble COO matrix from a list of possibly repeated triplets in CUDA

Question

Given a std::vector<Eigen::Triplet<double>> its possible to create a Eigen::SparseMatrix<double> using setFromTriplets function. This function, by default, accumulates repeated triplets.

I am trying to move my simulator code to CUDA and I have to build my sparse matrix. I have a kernel which computes the triplets and returns them in three device pointers (int*, int*, double*). However, I am not sure how to assemble the matrix from this pointers. cuSparse COO format says nothing about duplicated triplets and it actually says the rows should be sorted.

Is there anyway to create a COO (or even better, a CSR directly) with CUDA/cuSparse given the i, j and value pointers with multiple (i,j) entries.

paleonix · Answer 1 · 2022-12-04T16:09:51.390

The following code does the needed operations on the GPU using the Thrust library which is part of the CUDA Toolkit. If this operation is done often enough that you care about getting the best possible performance, you could directly use the CUB library which is used by Thrust as backend. That way you should be able to avoid multiple unnecessary copy operations. The APIs of interest for this would be cub::DeviceMergeSort (one might also be able to use cub::DeviceRadixSort with a type-punning tranform iterator) and cub::DeviceSegmentedReduce

One way of avoiding the unnecessary initialization of the temporary vectors can be found in the uninitialized_vector.cu Thrust example.

Another way of possibly improving performance without ditching Thrust would be to use the thrust::cuda::par_nosync execution policy. It is not yet implemented in the CUDA Toolkit 11.18 version of Thrust (1.15), so you will need to get a more recent version (1.16 or later) from Github. This can help hiding kernel launch overheads as it makes the Thrust algorithms asynchronous if possible (the segmented reduction has to be synchronous as it returns the new end iterators). You can also add explicit CUDA streams and pool memory resources to the API to take better care of the asynchronous behavior and avoid repeated allocation overhead respectively. See the cuda/explicit_cuda_stream.cu, cuda/custom_temporary_allocation.cu and mr_basic.cu Thrust examples for more information.

#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/tuple.h>

// Transforms triplets into COO sparse marix format in-place and
// returns the number of non-zero entries (nnz).
auto triplets2coo(int *d_rows,
                  int *d_cols,
                  float *d_vals,
                  int num_triplets) {
    // thrust::device_ptr is a lightweight wrapper that lets Thrust algorithms
    // know which memory the pointer points to. thrust::device_pointer_cast is
    // the preferred way of creating them. While an API directly taking
    // wrapped pointers as arguments would be preferable, I chose it this way 
    // to show how to wrap the pointers.
    auto const coord_iter = thrust::make_zip_iterator(
        thrust::make_tuple(
            thrust::device_pointer_cast(d_rows),
            thrust::device_pointer_cast(d_cols)
        ));
    auto const vals = thrust::device_pointer_cast(d_vals);

    // Sort in-place.
    thrust::sort_by_key(coord_iter, coord_iter + num_triplets,
                        vals);
    // Segmented reduction can only be done out-of-place,
    // so allocate more memory.
    thrust::device_vector<int> tmp_rows(num_triplets);
    thrust::device_vector<int> tmp_cols(num_triplets);
    thrust::device_vector<float> tmp_vals(num_triplets);
    // Caution: These vectors are unnecessarily initialized to zero!

    auto const tmp_coord_iter = thrust::make_zip_iterator(
        thrust::make_tuple(
            tmp_rows.begin(),
            tmp_cols.begin()
        ));

    auto const new_ends =
        thrust::reduce_by_key(coord_iter, coord_iter + num_triplets,
                              vals,
                              tmp_coord_iter,
                              tmp_vals.begin());
    auto const nnz = thrust::distance(tmp_vals.begin(), new_ends.second);

    thrust::copy_n(tmp_coord_iter, nnz, coord_iter);
    thrust::copy_n(tmp_vals.cbegin(), nnz, vals);

    return nnz;
}

Assemble COO matrix from a list of possibly repeated triplets in CUDA

1 Answers1