5

The task is like How to set bits of a bit vector efficiently in parallel?, but for CUDA.

Consider a bit vector of N bits in it (N is large, e.g. 4G) and an array of M numbers (M is also large, e.g. 1G), each in range 0..N-1 indicating which bit of the vector must be set to 1. The bit vector is just an array of integers, specifically uint32_t.

I've tried a naive implementation with atomicOr() on the global memory:

__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices,
      const uint32_t n_bits, uint32_t *bitset)
{
  const uint32_t n_threads = blockDim.x * gridDim.x;
  const uint32_t i_thread = threadIdx.x + blockDim.x * blockIdx.x;
  for(uint32_t i=i_thread; i<n_indices; i +=n_threads) {
    const uint32_t index = indices[i];
    assert(index < n_bits);
    const uint32_t i_word = index >> 5;
    const uint32_t i_bit = index & 31;
    atomicOr(bitset+i_word, 1u<<(i_bit));
  }
}

And it produces interesting results for 4G bits and 1G indices:

  • RTX3090: 0.0383266 sec. for sorted indices vs. 0.332674 sec. for unsorted (8.68x improvement)
  • RTX2080: 0.0564464 sec. for sorted indices vs. 1.23666 sec. for unsorted (21.91x improvement)

So it seems the devices coalesce/unite multiple atomicOr() operations within a warp, and perhaps L1 cache kicks in, so when indices conflict (which is the case for sorted indices), 32-bit assignments are in reality much faster than for non-conflicting indices (the unsorted case).

Can we further improve the sorted or unsorted case?

UPDATE: answering the comments, any solution is of interest, whether it's for sorted or unsorted case, with or without repetitions. Unsorted and with repetitions is a more generic case, so it would be of the most interest.

UPDATE2: following the suggestion to vectorize memory accesses, I implemented this:

__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices, const uint32_t n_bits, uint32_t *bitset) {
  const uint32_t n_threads = blockDim.x * gridDim.x;
  const uint32_t i_thread = threadIdx.x + blockDim.x * blockIdx.x;
  const uint32_t n_vectors = n_indices / 4;
  for(uint32_t i=i_thread; i<n_vectors; i +=n_threads) {
    const uint4 v_index = reinterpret_cast<const uint4*>(indices)[i];
    assert(v_index.x < n_bits);
    assert(v_index.y < n_bits);
    assert(v_index.z < n_bits);
    assert(v_index.w < n_bits);
    uint4 vi_word, vi_bit;
    vi_word.x = v_index.x >> 5;
    vi_word.y = v_index.y >> 5;
    vi_word.z = v_index.z >> 5;
    vi_word.w = v_index.w >> 5;
    vi_bit.x = v_index.x & 31;
    vi_bit.y = v_index.y & 31;
    vi_bit.z = v_index.z & 31;
    vi_bit.w = v_index.w & 31;
    atomicOr(bitset+vi_word.x, 1u<<vi_bit.x);
    atomicOr(bitset+vi_word.y, 1u<<vi_bit.y);
    atomicOr(bitset+vi_word.z, 1u<<vi_bit.z);
    atomicOr(bitset+vi_word.w, 1u<<vi_bit.w);
  }
  if(i_thread < 4) {
    const uint32_t tail_start = n_vectors*4;
    const uint32_t tail_len = n_indices - tail_start;
    if(i_thread < tail_len) {
      const uint32_t index = indices[tail_start+i_thread];
      assert(index < n_bits);
      const uint32_t i_word = index >> 5;
      const uint32_t i_bit = index & 31;
      atomicOr(bitset+i_word, 1u<<i_bit);
    }
  }
}

But at least on RTX2080 it's slower (I don't have the eGPU with RTX3090 with me right now to test):

  • RTX2080: 0.0815998 sec. for sorted vs. 1.39829 sec. for unsorted (17.14x ratio)
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • If the sorted case is common, you could try a warp-level vote to see if coalescing is worth it, then use warp-level reduction to make a single (or a few) atomic operations only. – Homer512 Jul 27 '22 at 07:21
  • @Homer512 Isn't that done automatically nowadays? See update to [this](https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/) blog post. – paleonix Jul 27 '22 at 08:00
  • @paleonix yes, the performance measurements suggest it's done automatically in the hardware. – Serge Rogatch Jul 27 '22 at 08:05
  • Jup, seems like it is. See also here: https://on-demand.gputechconf.com/gtc/2013/presentations/S3101-Atomic-Memory-Operations.pdf#page=28 – Homer512 Jul 27 '22 at 09:49
  • This problem is better described as "sparse to dense set representation". Except... you need to tell us whether the M values can have repetitions or not; and whether they're sorted or not. – einpoklum Jul 28 '22 at 17:23
  • @einpoklum, I've edited the question to address your comment. Basically, any solution would be of interest, but unsorted with repetitions case is more generic. – Serge Rogatch Jul 29 '22 at 03:44
  • 1
    Your peak throughput is 4.5GiB / .038s = 118 GiB/s which says you are not bandwidth bound (~12% of peak on 3090). Possibly latency bound. Have you run it in nsight-compute to see what the reported bottleneck is? My hunch of the best next step would be to use vectorized loads for the indices, and do 4 `atomicOr` per thread-iteration rather than one. See https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ – harrism Aug 09 '22 at 04:01
  • @harrism, thanks, I've added this experiment to the question. The program became slower... – Serge Rogatch Aug 09 '22 at 10:19
  • Suggest you profile it. – harrism Aug 10 '22 at 01:30
  • @SergeRogatch, maybe it makes sense to specify launch bounds for your kernel, especially the one with vectorized access. And try to play around with the max registers per thread (perhaps just having 16 regs per thread is more than enough here). Also, configure the cuda runtime to use 48k L1 cache and 16k registers or so per multiprocessor. – pem Nov 27 '22 at 13:57

0 Answers0