10

I have a float array that needs to be referenced many times on the device, so I believe the best place to store it is in __ constant __ memory (using this reference). The array (or vector) will need to be written once at run-time when initializing, but read by multiple different functions many millions of times, so constant copying to the kernel each function call seems like A Bad Idea.

const int n = 32;
__constant__ float dev_x[n]; //the array in question

struct struct_max : public thrust::unary_function<float,float> {
    float C;
    struct_max(float _C) : C(_C) {}
    __host__ __device__ float operator()(const float& x) const { return fmax(x,C);}
};
void foo(const thrust::host_vector<float> &, const float &);

int main() {
    thrust::host_vector<float> x(n);
    //magic happens populate x
    cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float));

    foo(x,0.0);
    return(0);
}

void foo(const thrust::host_vector<float> &input_host_x, const float &x0) {
    thrust::device_vector<float> dev_sol(n);
    thrust::host_vector<float> host_sol(n);

    //this method works fine, but the memory transfer is unacceptable
    thrust::device_vector<float> input_dev_vec(n);
    input_dev_vec = input_host_x; //I want to avoid this
    thrust::transform(input_dev_vec.begin(),input_dev_vec.end(),dev_sol.begin(),struct_max(x0));
    host_sol = dev_sol; //this memory transfer for debugging

    //this method compiles fine, but crashes at runtime
    thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x);
    thrust::transform(dev_ptr,dev_ptr+n,dev_sol.begin(),struct_max(x0));
    host_sol = dev_sol; //this line crashes
}

I tried adding a global thrust::device_vector dev_x(n), but that also crashed at run-time, and would be in __ global __ memory rather than __ constant__ memory

This can all be made to work if I just discard the thrust library, but is there a way to use the thrust library with globals and device constant memory?

user2462730
  • 171
  • 1
  • 10

1 Answers1

10

Good question! You can't cast a __constant__ array as if it's a regular device pointer.

I will answer your question (after the line below), but first: this is a bad use of __constant__, and it isn't really what you want. The constant cache in CUDA is optimized for uniform access across threads in a warp. That means all threads in the warp access the same location at the same time. If each thread of the warp accesses a different constant memory location, then the accesses get serialized. So your access pattern, where consecutive threads access consecutive memory locations, will be 32 times slower than a uniform access. You should really just use device memory. If you need to write the data once, but read it many times, then just use a device_vector: initialize it once, and then read it many times.


To do what you asked, you can use a thrust::counting_iterator as the input to thrust::transform to generate a range of indices into your __constant__ array. Then your functor's operator() takes an int index operand rather than a float value operand, and does the lookup into constant memory.

(Note that this means your functor is now __device__ code only. You could easily overload the operator to take a float and call it differently on host data if you need portability.)

I modified your example to initialize the data and print the result to verify that it is correct.

#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/counting_iterator.h>

const int n = 32;
__constant__ float dev_x[n]; //the array in question

struct struct_max : public thrust::unary_function<float,float> {
    float C;
    struct_max(float _C) : C(_C) {}

    // only works as a device function
    __device__ float operator()(const int& i) const { 
        // use index into constant array
        return fmax(dev_x[i],C); 
    }
};

void foo(const thrust::host_vector<float> &input_host_x, const float &x0) {
    thrust::device_vector<float> dev_sol(n);
    thrust::host_vector<float> host_sol(n);

    thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x);
    thrust::transform(thrust::make_counting_iterator(0),
                      thrust::make_counting_iterator(n),
                      dev_sol.begin(),
                      struct_max(x0));
    host_sol = dev_sol; //this line crashes

    for (int i = 0; i < n; i++)
        printf("%f\n", host_sol[i]);
}

int main() {
    thrust::host_vector<float> x(n);

    //magic happens populate x
    for (int i = 0; i < n; i++) x[i] = rand() / (float)RAND_MAX;

    cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float));

    foo(x, 0.5);
    return(0);
}
harrism
  • 26,505
  • 2
  • 57
  • 88
  • thanks for your help! The vector will be a power of 2 elements long, probably >=8096, so I'll drop the idea of using __ constant __ memory – user2462730 Jun 13 '13 at 06:30
  • If I change to a global device_vector and reference that, I get a crash at run-time (well, debug run-time time) Can I add a global device_vector or does it need to be declared in main() and passed by reference? – user2462730 Jun 13 '13 at 06:33
  • Power of 2 or size is not the reason not to use `__constant__` here -- it's as I said: yours is not the type of memory access pattern for which `__constant__` is optimized. Regarding your crash: why make it a global? The problem I see with making it global is that you would not be able to create the array with a size determined at runtime, because the constructor would be called before main(). There are also tricky issues with the order of construction of globals across compilation units. Generally I would create it in a function and pass it by reference. – harrism Jun 13 '13 at 10:56
  • @harrism Please, can you clarify? You said: "The constant cache in CUDA is optimized for uniform access across threads in a warp. That means all threads in the warp access the same location at the same time." - does it mean that if I use random access to `__constant__` memory then it will not have any advantage compared with the global memory allocated by `cudaMemalloc()`? But how can I speedup memory access in this case, should I use `LDG load`? http://on-demand.gputechconf.com/gtc/2013/presentations/S3011-CUDA-Optimization-With-Nsight-VSE.pdf – Alex Apr 20 '14 at 19:45