There seem to be several problems with the code you have presented.
As suggested here, you must use bare pointers carefully with a thrust algorithm, if those bare pointers are device pointers. Among options you have are to either wrap those bare pointers in thrust::device_ptr
and use those instead, or pass thrust::device
execution policy. If you fail to use any such method, thrust will choose the host execution path for your algorithm. If the pointers are actually device pointers, hilarity ensues.
As suggested in the other answer, if you wish to modify the incoming data at its source, you will need to use a reference. The only place this really matters is on your transform op. In either case, the incoming references appear to be convertible to your usage, which is why your code compiles.
When using lambdas in device code, such as you intend to do, its necessary to decorate those lambdas correctly with __device__
(or __host__ __device__
, but only __device__
is necessary for your usage here). If you decorate with __device__
only, no further changes are needed. If you decorate with __host__ __device__
, you will discover that generic lambdas ((auto ...)
) cannot be so decorated. This last bit applies to your first lambda (transform op), which would be fixable for the case you have shown.
The following code has those 3 items fixed, and appears to show correct modification of the source data:
$ cat t1835.cu
#include <thrust/scan.h>
#include <thrust/execution_policy.h>
#include <iostream>
int main(){
const int N = 10;
//Yes, I wish to use raw pointers rather than nice device vectors.
int* data;
if(cudaMalloc(&data, N * sizeof(int))!=cudaSuccess){
std::cout<<"Couldn't allocate!"<<std::endl;
return -1;
}
cudaMemset(data, 0, N * sizeof(int));
const int answer = thrust::transform_reduce(thrust::device,
data, data + N,
[]__device__(auto &x) -> int32_t {
//I would like to change the values of data here
x = 3;
return 2;
},
int32_t{0},
[]__device__(const int32_t running_sum, const int32_t this_value) -> int32_t {
return running_sum + this_value;
}
);
if(answer!=20){
std::cout<<"Wrong answer!"<<std::endl;
return -1;
}
int *h_data = new int[N];
cudaMemcpy(h_data, data, N*sizeof(data[0]), cudaMemcpyDeviceToHost);
cudaFree(data); //In actuality, I'd do something with transformed data here
std::cout << h_data[0] << std::endl;
return 0;
}
$ nvcc -lineinfo -arch=sm_70 -O3 -o t1835 t1835.cu -std=c++14 --expt-extended-lambda
$ ./t1835
3
$
(CUDA 11.2, Centos 7, V100 GPU)
I suspect the reason that your posted code appears to compile and run correctly is that:
- The host execution path was actually selected
- The construction of your transform op was such that the read of the incoming data could be optimized out by the compiler (or, perhaps, equivalently, there is no read of the incoming data), thus no illegal access occurs.