Using thrust::transform_reduce to transform an array (inclusive scan is not an answer)

Question

(Note this question absolutely does not answer my question.)

I would like to use thrust::transform_reduce to change the values of an array as well as perform a reduction. However, transform_reduce seems require that transform/reduce functions have pass-by-value semantics. Is there any way to access the values by address?

The following MWE exemplifies the issue:

#include <thrust/scan.h>

#include <iostream>

int main(){
  const int N = 10;

  //Yes, I wish to use raw pointers rather than nice device vectors.
  int* data;
  if(cudaMalloc(&data, N * sizeof(int))!=cudaSuccess){
    std::cout<<"Couldn't allocate!"<<std::endl;
    return -1;
  }

  const int answer = thrust::transform_reduce(
    data, data + N,
    [](auto x) -> int32_t {
      //I would like to change the values of data here
      return 2;
    },
    int32_t{0},
    [](const int32_t running_sum, const int32_t this_value) -> int32_t {
      return running_sum + this_value;
    }
  );

  if(answer!=20){
    std::cout<<"Wrong answer!"<<std::endl;
    return -1;
  }

  cudaFree(data); //In actuality, I'd do something with transformed data here

  return 0;
}

Robert Crovella · Answer 1 · 2021-02-14T03:11:57.807

There seem to be several problems with the code you have presented.

As suggested here, you must use bare pointers carefully with a thrust algorithm, if those bare pointers are device pointers. Among options you have are to either wrap those bare pointers in thrust::device_ptr and use those instead, or pass thrust::device execution policy. If you fail to use any such method, thrust will choose the host execution path for your algorithm. If the pointers are actually device pointers, hilarity ensues.
As suggested in the other answer, if you wish to modify the incoming data at its source, you will need to use a reference. The only place this really matters is on your transform op. In either case, the incoming references appear to be convertible to your usage, which is why your code compiles.
When using lambdas in device code, such as you intend to do, its necessary to decorate those lambdas correctly with __device__ (or __host__ __device__, but only __device__ is necessary for your usage here). If you decorate with __device__ only, no further changes are needed. If you decorate with __host__ __device__, you will discover that generic lambdas ((auto ...)) cannot be so decorated. This last bit applies to your first lambda (transform op), which would be fixable for the case you have shown.

The following code has those 3 items fixed, and appears to show correct modification of the source data:

$ cat t1835.cu
#include <thrust/scan.h>
#include <thrust/execution_policy.h>

#include <iostream>

int main(){
  const int N = 10;

  //Yes, I wish to use raw pointers rather than nice device vectors.
  int* data;
  if(cudaMalloc(&data, N * sizeof(int))!=cudaSuccess){
    std::cout<<"Couldn't allocate!"<<std::endl;
    return -1;
  }
  cudaMemset(data, 0, N * sizeof(int));
  const int answer = thrust::transform_reduce(thrust::device,
    data, data + N,
    []__device__(auto &x) -> int32_t {
      //I would like to change the values of data here
      x = 3;
      return 2;
    },
    int32_t{0},
    []__device__(const int32_t running_sum, const int32_t this_value) -> int32_t {
      return running_sum + this_value;
    }
  );

  if(answer!=20){
    std::cout<<"Wrong answer!"<<std::endl;
    return -1;
  }
  int *h_data = new int[N];
  cudaMemcpy(h_data, data, N*sizeof(data[0]), cudaMemcpyDeviceToHost);
  cudaFree(data); //In actuality, I'd do something with transformed data here
  std::cout << h_data[0]  << std::endl;

  return 0;
}
$ nvcc -lineinfo -arch=sm_70 -O3  -o t1835 t1835.cu -std=c++14 --expt-extended-lambda
$ ./t1835
3
$

(CUDA 11.2, Centos 7, V100 GPU)

I suspect the reason that your posted code appears to compile and run correctly is that:

The host execution path was actually selected
The construction of your transform op was such that the read of the incoming data could be optimized out by the compiler (or, perhaps, equivalently, there is no read of the incoming data), thus no illegal access occurs.

score 0 · Answer 2 · answered Feb 13 '21 at 03:51

Seems like the signature of your functions are wrong. We can se here that the function passed in unary_op has a parameter which receives a reference const T &x. The binary_op have parameters passed by reference too, as we can see here. You can do:

const int answer = thrust::transform_reduce(
  data, data + N,
  [](auto &x) -> int32_t {
    // Change the values as much as you want
    return 2;
  },
  int32_t{0},
  [](const int32_t &running_sum, const int32_t &this_value) -> int32_t {
    return running_sum + this_value;
  }
);

Using thrust::transform_reduce to transform an array (inclusive scan is not an answer)

2 Answers2