compile constant memory array to immediate value in CUDA

Question

I am writing a code to approximate a function using power series and would like to exploit #pragma unroll and FMA instruction, like this:

__constant__ double coeff[5] = {1.0,2.0,3.0,4.0,5.0}; /* constant is fake here */

__device__ double some_function(double x) {
  double y;
  int i;
  y = coeff[0];
#pragma unroll
  for(i=1;i<5;i++) y = y*x + coeff[i];
  return y;
}

The code will be compiled into assembly like this:

ld.const.f64    %fd33, [coeff];
ld.const.f64    %fd34, [coeff+8];
fma.rn.f64      %fd35, %fd33, %fd32, %fd34;
ld.const.f64    %fd36, [coeff+16];
fma.rn.f64      %fd37, %fd35, %fd32, %fd36;
ld.const.f64    %fd38, [coeff+24];
fma.rn.f64      %fd39, %fd37, %fd32, %fd38;
ld.const.f64    %fd40, [coeff+32];
fma.rn.f64      %fd41, %fd39, %fd32, %fd40;

I want to avoid the use of constant memory and use immediate value like this:

mov.f64         %fd248, 0d3ED0EE258B7A8B04;
mov.f64         %fd249, 0d3EB1380B3AE80F1E;
fma.rn.f64      %fd250, %fd249, %fd247, %fd248;
mov.f64         %fd251, 0d3EF3B2669F02676F;
fma.rn.f64      %fd252, %fd250, %fd247, %fd251;
mov.f64         %fd253, 0d3F1745CBA9AB0956;
fma.rn.f64      %fd254, %fd252, %fd247, %fd253;
mov.f64         %fd255, 0d3F3C71C72D1B5154;
fma.rn.f64      %fd256, %fd254, %fd247, %fd255;
mov.f64         %fd257, 0d3F624924923BE72D;
fma.rn.f64      %fd258, %fd256, %fd247, %fd257;
mov.f64         %fd259, 0d3F8999999999A3C4;
fma.rn.f64      %fd260, %fd258, %fd247, %fd259;
mov.f64         %fd261, 0d3FB5555555555554;
fma.rn.f64      %fd262, %fd260, %fd247, %fd261;

I know that I can use #define macro to do so, but it is very inconvenient when there are many coefficients.

Are there any C data type modifier (or compiler options) that could convert my coefficients array into immediate values instead of using constant memory?

I tried and it does not work for static double, static __constant__ double and static __device__ double.

and my final question is: I guess using immediate value should be faster than using constant memory?

What is important for performance is the final machine code (SASS), not the intermediate representation (PTX). Depending on your target architecture, the FMA instruction can reference the constant memory directly without separate load instructions and this is as fast as it will get (uniform access to constant memory is practically as fast as register access). Your code compiled for sm_35 results in this sequence: `DFMA R2, R4, c[0x3][0x0], R2; DFMA R2, R2, R4, c[0x3][0x10]; DFMA R2, R2, R4, c[0x3][0x18]; DFMA R4, R2, R4, c[0x3][0x20];` — njuffa, Jan 20 '15 at 18:11
Even if the performance is the same, getting rid of it will also saved my constant memory space (only 64K on my display card) — wh0, Jan 21 '15 at 02:26

score 4 · Accepted Answer · edited May 23 '17 at 11:53

Ok what you are trying to do is not possible (at least not with CUDA) in the way you try it and that is because CUDA forbids declarations of static const arrays in global scope. CUDA demands that each global array is assigned to a specific address space (__device__, __contant__, etc.).

However with some inconvenience it is possible.

I assembled some SO answers:

C++11: Compile Time Calculation of Array

Is it possible to develop static for loop in c++?

, please honor the work over there, and added some CUDA.

Here you are:

What you want is that the compiler to do the dirty work for you and therefore you have to make all and everything evaluated at compile time:

At first we need a static array where we can store the coefficients:

template <unsigned int index, long long... remPack> struct getValue;

template <unsigned int index, long long In, long long... remPack>
struct getValue<index, In, remPack...> {
  static const long long value = getValue<index - 1, remPack...>::value;
};

template <long long In, long long... remPack>
struct getValue<1, In, remPack...> {
  static const long long value = In;
};

template <long long... T> struct static_array {
  template <int idx> static __device__ int get() { return getValue<idx, T...>::value; }
};

This static_array stores values in the C++ type system as long long. I'll come back to that later in the answer.

Next thing is the for loop that must be unrolled. Again wie use template meta programming for that:

template <int First, int Last, template <int> class Fn> struct static_for {
  __device__ double operator()(double x, double y) const {
      return static_for<First + 1, Last, Fn>()(x, Fn<First + 1>()(x, y));
  }
};

template <int N, template <int> class Fn> struct static_for<N, N, Fn> {
  __device__ double operator()(double x, double y) const { return y; }
};

Since we do all static at compile time it is necessary to move the input and the output of each "loop trip" through the parameters and the return expression of the operator().

This solution is very static and with more template meta programming you possible can improve it.

Ok now the interesting part. The computations:

template <int i> struct Function {
  __device__ double operator()(double x, double y) {
    double c = __longlong_as_double(static_array<12, 34, 22, 55, 24>::get<i>());
    return y * x + c;
  }
};

__device__ double some_function(double x) {
  return static_for<0, 5, Function>()(x, 0.0);
}

The C++ type system allows only integer types as non-type template parameters so we have to store our doubles in long longs and then use CUDA's __longlong_as_double() function to transform them back. This is something we have to accept at this point and may be a deal breaker for you because it is not "simple" anymore. However, a double to long long converter shouldn't be that hard to write.

The entire computation is wrap in a functor object that gets the trip counter from our static_loop as template argument. With this compile time "trip counter" we can access the static_array transform the long long version of the double back and compute the FMA.

Thanks to the CUDA compiler (which does a realy good job here) this is the PTX code (nvcc -ptx -arch=sm_35 test.cu) I used the 7.0 RC1 version:

.visible .func  (.param .b64 func_retval0) _Z13some_functiond(
        .param .b64 _Z13some_functiond_param_0
)
{
        .reg .f64       %fd<7>;

        ld.param.f64    %fd1, [_Z13some_functiond_param_0];
        fma.rn.f64      %fd2, %fd1, 0d0000000000000000, 0d000000000000000C;
        fma.rn.f64      %fd3, %fd2, %fd1, 0d0000000000000022;
        fma.rn.f64      %fd4, %fd3, %fd1, 0d0000000000000016;
        fma.rn.f64      %fd5, %fd4, %fd1, 0d0000000000000037;
        fma.rn.f64      %fd6, %fd5, %fd1, 0d0000000000000018;
        st.param.f64    [func_retval0+0], %fd6;
        ret;
}

Thank you very much for your efforts. I think it should work. But I know nothing about C++ meta-programming. Originally I think it should work by adding some compiler options. Given the complexity of this method. I will try to write 20-30 lines of __fma_rn(...) instead. — wh0, Jan 21 '15 at 02:25

Soonts · Answer 2 · 2018-04-28T23:33:08.457

At least in Cuda 8, local constexpr arrays work OK, i.e. for unrolled loops the *.ptx contains immediate values, not memory references. Example (untested):

#define COEFF_VALUES { 1.0, 2.0, 3.0, 4.0, 5.0 }

__device__ double some_function( double x )
{
    constexpr double coeff[ 5 ] = COEFF_VALUES;
    double y;
    int i;
    y = coeff[ 0 ];
#pragma unroll
    for( i = 1; i < 5; i++ ) y = y*x + coeff[ i ];
    return y;
}

Compiles to code like this:

add.f64     %fd2, %fd1, 0d4000000000000000;
fma.rn.f64  %fd3, %fd1, %fd2, 0d4008000000000000;
fma.rn.f64  %fd4, %fd1, %fd3, 0d4010000000000000;
fma.rn.f64  %fd5, %fd1, %fd4, 0d4014000000000000;

compile constant memory array to immediate value in CUDA

2 Answers2