Using CUB::DeviceScan

Question

I'm trying to do an exclusive sum reduction in CUDA. I am using the CUB library and have decided to try the CUB::DeviceReduce. However, my result is NaN, and I can't figure out why.

Code is:

#include <cub/cub.cuh>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
using std::cout;
using std::endl;
#define DSIZE 512

void dev_cumsum( const float *dev_inData, float *dev_outData ) {
    int n = 512;
    void* dev_temp_storage = NULL;
    size_t temp_storage_bytes = 0;
    cub::DeviceScan::ExclusiveSum(dev_temp_storage,temp_storage_bytes,const_cast<float*>(dev_inData),dev_outData,n);
    cudaMalloc(&dev_temp_storage,temp_storage_bytes);
    cub::DeviceScan::ExclusiveSum(dev_temp_storage,temp_storage_bytes,const_cast<float*>(dev_inData),dev_outData,n);
}

int main(){
    float h_data[512];
    float* d_data;
    float* d_result;
    float h_result[512];
    cudaMalloc(&d_data, DSIZE*sizeof(float));
    cudaMalloc(&d_result, DSIZE*sizeof(float));
    h_data[0] = rand()%10;
    h_result[0] = 0;
    for (int i=1; i<DSIZE; i++) {
        h_data[i] = rand()%10;
        h_result[i] = h_data[i-1]+h_result[i-1];
    }
    cudaMemcpy(d_data, h_data, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
    dev_cumsum(d_data, d_result);
    printf("CPU result = %f\n", h_result[511]);
    cudaMemcpy(h_result, d_result, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
    printf("GPU result = %f\n", h_result[511]);
    for( int i = 0; i < DSIZE; i++ ) {cout << h_result[i] << " ";}
    cout << endl;
    return 0;
}

This code gives me NaN for the last 8 elements of the device result.

This code is running on a GTX650 Ti Boost in Linux Mint15. I'm using NSight and the console output compile command is:

Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -G -g -O0 -gencode arch=compute_30,code=sm_30 -odir "" -M -o "main.d" "../main.cu"
/usr/local/cuda-5.5/bin/nvcc --device-c -G -O0 -g -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30  -x cu -o  "main.o" "../main.cu"

Cuda version is 5.5 CUB version 1.0.2

This was tested on another computer with Cuda 6, OSX10.9.2, CUB 1.2.3 and running a GT750M, and reproduced the error of last 8 numbers being NaN

edit: The code works correctly with int and double, but not float.

edit: With thanks to Robert Crovella, this question was originally asked in regards to DeviceReduce. That code worked, it was throwing NaN because earlier code using DeviceScan was feeding it NaN as input. Question is revised to suit

There's nothing wrong with the code you've shown. [Here's a fully-worked example](http://pastebin.com/m7LmQXKA) based on your first code snippet. The problem you're having is either in some aspect of the code you haven't shown, or else with your machine configuration (CUDA not installed/not working correctly, etc.) Post a complete code, try running your code with `cuda-memcheck`, and/or add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code. — Robert Crovella, Apr 29 '14 at 05:35
Thanks @Robert. It turns out the problem was NaN input from an earlier DeviceScan block of code. The question has been rewritten and has fully compilable code which reproduces the error (for me) — user2462730, Apr 29 '14 at 06:26

Robert Crovella · Answer 1 · 2014-05-16T21:31:16.217

EDIT: cub 1.3.0 was recently released, and I believe it includes a fix for this issue.

There's a few changes I would make to your code, that I consider to be errors, but I don't know if they are affecting what you are seeing. In the following code section, you are using h_result[0] without initializing it, so add the line I have marked with a comment:

h_data[0] = rand()%10;
h_result[0] = 0;    // ADD THIS LINE
for (int i=1; i<DSIZE; i++) {
    h_data[i] = rand()%10;
    h_result[i] = h_data[i-1]+h_result[i-1];
}

(Clearly that one should not be influencing your GPU result.) Also, your final cudaMemcpy operation is not quite right:

cudaMemcpy(&h_result, d_result, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
           ^
           delete this ampersand

Since h_result is already a pointer in your formulation, we don't need to pass the address of it to cudaMemcpy.

Can you try making those changes and see what kind of results you get?

I've been struggling a bit with this. If you can still reproduce the error, I'd appreciate it if you can:

reboot your machine and try again
respond back with the actual updated code you are running, the compile command you are using, the CUDA version, the CUB version, and the GPU you are running on, as well as the system OS. (edit your original question with this info)

Thanks @Robert, I've made those changes, added system info. I've also tested on another system running updated CUB and CUDA - same error — user2462730, Apr 29 '14 at 23:26
Yes, there seems to be an issue. I expect someone will have more to say shortly. — Robert Crovella, May 01 '14 at 15:23

score 1 · Answer 2 · answered Apr 30 '14 at 02:50

When I ran the code, I found that it is not the last 8 values that are set to NaN but it is, in fact, all of the values since the last integer-multiple of 72 that are set to NaN. In your example there are 512 values: this means that the first 504 (7 * 72) were correct and the following 8 values were NaN.

This behaviour seems to continues until 568 (8 * 72) values and thereafter it seems to work correctly.

The code that I used to test this is here: http://pastebin.com/kXVvuKAN

I compiled the code with the following command:

nvcc --relocatable-device-code=true -gencode arch=compute_30,code=compute_30 -G -o main main.cu

NOTE: If I didn't use the -G parameter, the results were more random. However, with the -G command, it gave the clear pattern mentioned above.

Using CUB::DeviceScan

2 Answers2