Questions tagged [cub]

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model.

CUB (CUDA UnBound) is a C++ template library of components for use on NVIDIA GPUs running CUDA.

CUB includes common data parallel operations such as prefix scan, reduction, histogram and sort. CUB's collective primitives are not bound to any particular width of parallelism or to any particular data type and can be used at device, block, warp or thread scope.

It is used in the backend of other NVIDIA libraries, most prominently Thrust and RAPIDS.

CUB is developed by NVIDIA Research and it's website and documentation is hosted at https://nvlabs.github.io/cub with the most recent source code being available on GitHub. It is also distributed with the CUDA Toolkit since at least CUDA 11.1.1 (first version where CUB documentation is linked from CUDA Tookit documentation).

48 questions
17
votes
3 answers

Block reduction in CUDA

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512) than a…
Ono
  • 1,357
  • 3
  • 16
  • 38
6
votes
4 answers

Sorting (small) arrays by key in CUDA

I'm trying to write a function that takes a block of unsorted key/value pairs such as <7, 4> <2, 8> <3, 1> <2, 2> <1, 5> <7, 1> <3, 8> <7, 2> and sorts them by key while reducing the values of pairs with the same key: <1, 5> <2, 10> <3, 9> <7,…
user1743798
  • 445
  • 2
  • 7
  • 20
4
votes
3 answers

Cost functional calculation for global optimization in CUDA

I am trying to optimize a function (say find the minimum) with n parameters (Xn). All Xi's are bound to a certain range (for example -200 to 200) and if any parameter leaves this range, the function goes to infinity very fast. However, n can be…
3
votes
1 answer

What is the usual way to use a modified C++ header-only library in my own open source project?

I want to use a modified C++ header library in my own open source project, but not sure what is the usual way to do it. For example, to use the original header library "CUB" in my project, I only need to: download CUB include the "umbrella" header…
Jason7525
  • 31
  • 1
3
votes
3 answers

Sorting many small arrays in CUDA

I am implementing a median filter in CUDA. For a particular pixel, I extract its neighbors corresponding to a window around the pixel, say a N x N (3 x 3) window, and now have an array of N x N elements. I do not envision using a window of more than…
Eagle
  • 1,187
  • 5
  • 22
  • 40
3
votes
2 answers

CUDA reduction of many small, unequally sized arrays

I am wondering if anyone could suggest the best approach to computing the mean / standard deviation of a large number of relatively small but differently sized arrays in CUDA? The parallel reduction example in the SDK works on a single very large…
zenna
  • 9,006
  • 12
  • 73
  • 101
2
votes
1 answer

Why does this CUDA reduction fail if I use 31 blocks?

The following CUDA code takes a list of labels (0, 1, 2, 3, ...) and finds the sums of the weights of these labels. To accelerate the calculation, I use shared memory so that each thread maintains its own running sum. At the end of the calculation,…
Richard
  • 56,349
  • 34
  • 180
  • 251
2
votes
0 answers

in-place reduce sum for CUDA (CUB/Thrust)?

I have a device vector that needs to be transformed in multiple ways (e.g. creating 20 new arrays from it) and then reduce all (sum/accumulate), returning those sums in a host vector. The code is working with thrust::transform_reduce but looking at…
2
votes
1 answer

Installing CUB in nvidia nsight

I want to use CUB with NVIDIA Nsight. I looked for tutorials on the internet for doing that, but I didn't find anything, even in the official pages pf CUB. What do I need to do in order to use CUB in code I write using NVIDIA Nsight?
2
votes
1 answer

CUB template similar to thrust

Following is a thrust code: h_in_value[7] = thrust::reduce(thrust::device, d_in1 + a - b, d_ori_rho_L1 + a); Here, the thrust::reduce takes the first and last input iterator, and thrust returns the value back to the CPU(copied to h_in_value) Can…
Ameya Wadekar
  • 49
  • 1
  • 3
2
votes
1 answer

Sum reduction with CUB

According to this article, sum reduction with CUB Library should be one of the fastest way to make parallel reduction. As you can see in a code fragment below, the execution time is measure excluding first cub::DeviceReduce::Reduce(temp_storage,…
physicist
  • 101
  • 3
  • 9
2
votes
1 answer

CUDA Thrust sort or CUB::DeviceRadixSort

I have a pool of particles represented by an array of float4 where the w component is the particle's current lifetime in the range [0, 1]. I need to sort this array based on the lifetime of the particles in descending order so that I can keep an…
Kinru
  • 389
  • 1
  • 6
  • 22
2
votes
2 answers

Residual calculation using CUDA

I have two vectors (oldvector and newvector). I need to calculate the value of the residual which is defined by the following pseudocode: residual = 0; forall i : residual += (oldvector[i] - newvector[i])^2 Currently, I am calculating this with two…
aatish
  • 272
  • 2
  • 13
2
votes
1 answer

CUDA cub::DeviceScan and the temp_storage_bytes parameter

I'm using cub::DeviceScan functiona and the sample code snippet has a parameter temp_storage_bytes, which it uses to allocate memory (which, incidentally, the code snippet never frees). The code snippet calls cub::DeviceScan functions with a pointer…
user2462730
  • 171
  • 1
  • 10
2
votes
2 answers

Using CUB::DeviceScan

I'm trying to do an exclusive sum reduction in CUDA. I am using the CUB library and have decided to try the CUB::DeviceReduce. However, my result is NaN, and I can't figure out why. Code is: #include #include #include…
user2462730
  • 171
  • 1
  • 10
1
2 3 4