4

The code was written on the fly and changing name convention, so sorry if I made some mess. I'll rewrite here the question to make it clearer.

There is some data known at compile time, two arrays of integers D and E, both with length L. Each element of D is either zero or one. Each element of E contains a value in [0,L].

Then I've got a vector X which is known at run-time, also of length L.

I want to build a function that compute a certain value using D, E and X, for example:

int comp_rt(int i, array<int, L> X) {
    int v = 0;
    if (D[i] == 0) // D[i] known at compile-time
        return 10;
    for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    return v;
}

Since this computation is performed a lot of times, I want to reduce overheads and I thought that would be great to perform checks and loops over D and E at compile time.

Usually, to make it faster, instead of using the comp_rt function - which is the general case, I would write template specialized functions that, for each i, would just do the math. For example:

N = 5
D = [0, 1, 1, 0, 1] // Values in {0, 1}
E = [1, 0, 3, 2, 4] // Values in [0, L-1]
X = [1, 3, 5, 7, 9] // Any integer

template <int i> int comp(array<int, L> X);
template <> int comp_tpl<0>(array<int, L> X) { return 10; } // D[0] == 0
template <> int comp_tpl<1>(array<int, L> X) { return 0; } // E[1] == 0, skip loop
template <> int comp_tpl<2>(array<int, L> X) { return X[0] + 2 * X[1] + 3 * X[2]; }
template <> int comp_tpl<3>(array<int, L> X) { return 10; }
template <> int comp_tpl<4>(array<int, L> X) { return compl_tpl<2>(X) + 4 * X[3]; }

My question is: is it possible to use templates and/or constant expression to build functions at compile time using D and E, but perform as fast as comp_tpl? I mean to build something that "builds the expression to be computed at run time", and only computations involving X are left for the run time.

And, if it is possible, how it is done? Which general principles can be used to solve problems like that?

I tried using templates to do that, but resulting code is not as fast as comp_tpl... There are some recursive calls that I think are evaluated at run time.

AkiRoss
  • 11,745
  • 6
  • 59
  • 86
  • Maybe a simple pre-computed lookup table would be a good alternative here. – Fozi Nov 06 '13 at 21:52
  • Well, yes, but the issue is about how to solve this kind of problems. For example, a similar situation can be found when trying to solve differential equations: given a model (at compile time), differential equations can be built from it, and then a function to compute the value of the equations at different points (X) is called often. I am trying to generalize :) – AkiRoss Nov 06 '13 at 21:55
  • Is `comp_rt_final` called in a loop which iterates `i` from 0 to L-1 and sets `X[i]=comp_rt_final(i,X)`? If so, I guess that `comp_rt_final` and this loop are sufficient to exactly specify your algorithm. – Aaron McDaid Nov 07 '13 at 00:46
  • Actually, I have another comment. I suggest you rewrite your question to make it shorter. Also, I suggest you begin with a clear statement of the algorithm (for example, `comp_rt_final` and the loop that calls it). Also, I urge you not to get too distracted about compile-time optimizations - just state the algorithm clearly before asking about optimization. Then, after that, clarify which variables are known at compile time (`D`,`E`,`L`?) and which are the inputs known only at runtime (`X`?). Also, what's `N`? Finally, you should give a name to your output variable (perhaps `Y`?). – Aaron McDaid Nov 07 '13 at 00:58
  • 1
    You wrote (int j = 0; j < E[i]; ++i) It's ++j no ? – Davidbrcz Nov 07 '13 at 08:53
  • Yes, sorry, was j++! Typo – AkiRoss Nov 07 '13 at 10:42
  • @AaronMcDaid, I'm computing every `X[i]` here, but in reality it is not necessary so. And yes, `D`, `E`, `L` are compile time known, `X` is only runtime. `N` is wrong, it is supposed to be `L` of course since `E` contains indices for `X`. – AkiRoss Nov 07 '13 at 10:44
  • @AaronMcDaid question rewritten. Indeed, it was too messy and my choice for identifiers was unhappy. – AkiRoss Nov 07 '13 at 11:00
  • I've made an edit, but I realised there's something I need confirmation of. @AkiRoss, I assume the elements of `E` are between 0 and L-1 (inclusive)? i.e. the *can* equal zero, but *cannot* equal L? I've changed the question to say `[0,L-1]`. Apologies if I was incorrect in this. – Aaron McDaid Nov 07 '13 at 11:41
  • Where is the recursion? In `comp_rt` there is no recursion. Perhaps your last sentence should read: "I tried using templates to do that, but resulting code is not as fast as *comp_rt*... There are some recursive calls *in the comp_tpl code* that I think are evaluated at run time." (In other words, recursion isn't a fundamental part of the calculation you're trying to implement? It's just (potentially) a part of a possible solution?) – Aaron McDaid Nov 07 '13 at 11:43
  • ... continuing my last comment ... There is no recursion on `comp_rt`, but there is recursion in `comp_tpl` (where <4> calls <2>). I don't understand why. Perhaps you could say that `comp_rt` is the definition of the problem? And that `comp_tpl` is *not* intended to contradict the problem definition given in `comp_rt`? – Aaron McDaid Nov 07 '13 at 11:49
  • @AaronMcDaid `E` can have values in `[0,L]` because it can loop over none or all elements in `X`, so `L` is inclusive. The recursive calls are obviously in my template solution, yes. – AkiRoss Nov 07 '13 at 12:31
  • @AaronMcDaid, In my opinion you're sticking too much on this *particular* instance, that I am taking as example, but my question is much more general. – AkiRoss Nov 07 '13 at 12:34
  • In the question, you haven't said what D and E actually are. `int[]` or `std::array`? `const` or `constexpr`, or neither? – Aaron McDaid Nov 12 '13 at 16:20
  • I am using `constexpr std::array`. – AkiRoss Nov 12 '13 at 16:37

4 Answers4

2

Edit: Updated according to clarifications in question:
Edit2: Removed the Conditional.

This calculates the sum much like before; It's tail-recursive*:

template<class T, size_t Length> struct Sum {
    template<class Array>
    static T comp(const Array &x, T add = 0) {
        return Sum<T, Length - 1>::comp(x, add + Length * x[Length - 1]);
    }
};

template<class T> struct Sum<T, 0> {
    template<class Array>
    static T comp(const Array &x, T add = 0) {
        return add;
    }
};

This is the part that pulls it together and depends on d and e. You probably could parameterize them but I think it's more trouble than worth it.

constexpr int d[] = { 0, 1, 1, 0, 1 };
constexpr int e[] = { 1, 0, 3, 2, 4 };

template<int N> struct Comp {
    template<class Array>
    static int comp(const Array &x) {
        return d[N] ? Sum<int, e[N]>::comp(x) : 10;
    }
};

Usage:

int x[] = { 1, 3, 5, 7, 9 };
Comp<3>::comp(x);

http://ideone.com/PmFBhU

(*) Well not really, but close enough.

Fozi
  • 4,973
  • 1
  • 32
  • 56
  • Uhm, no :P Your code just computes x[0] + 2 * x[1] + 3 * x[2]..., but in my code the value for that sequence is computed depends on two vectors `D` and `E`. Sorry, maybe my code was not clear as I did not used the dependency on `D` in `comp_rt`, but `D[i]` and `E[i]` still must be evaluated to build the expression. – AkiRoss Nov 06 '13 at 23:49
  • Nice, I like this one :) Did you test its speed? – AkiRoss Nov 07 '13 at 18:13
  • @AkiRoss No, except for ideone saying it ran in 0 seconds ;) Usually code like this will be optimized fairly well and runs as fast as you can get, but it does rely on the tail call optimization that can be fickle on the compiler actually applying them (in my experience). – Fozi Nov 07 '13 at 22:00
  • Ok, very good. I'll try asap to check if tail-optimization improves performances. I notice my tentative solution is more complex than your as I do not use `constexpr` so effectively, and there is no tail optimization. Thanks! – AkiRoss Nov 07 '13 at 22:37
  • Finally I had time to try in this way, and actually I'm getting very good results, but I need to turn on compiler optimization or the code will not be as fast as the hand-crafted equivalent. – AkiRoss Nov 12 '13 at 15:42
1

(Update: timing experiments with clang++ and g++ are discussed at the end. Also, for simplicity, I use the exact body for comp_rt in the question, demonstrating that it can optimize fully without us needing rewrite the body of the function.)

Yes, this can be done. But, g++ seems to do a lot of this stuff for you anyway without you realising it, see experiment at the end. With clang++ however, you really can see the runtime version being slower.

In the program below, all parameters, apart from X, are passed as template args. Therefore, there will be a different comp_rt template function built for every combination of parameters you use. This could cause your binary to grow large if L is large.

The way I've handed D[i]==0 might be difficult to understand at first. I've placed it inside an enable_if. There are two definitions of comp_tpl here, one that is used when D[i]==0 and that is used when D[i]==1. To be honest, this is probably unnecessary, I suspect the code would still be compiled optimally even if you just used the original body of the function inside a single comp_rt function template. (I've removed this complication).

I've include a line like this inside the function:

    using confirm_Ei_is_known_at_compile_time = array<char,E[i]>;

This confirms that E[i] is known to the compiler at compile time. This is equivalent to a typedef, and the number of elements in an array must be known at compile. If, for example, you attempted to use X[i] instead of E[i] as the size of the array, the compiler would reject the code. Note: this line does nothing, it's just a sanity check at compile time.

Finally, given that E[i] is known at compile time, the compiler is able to unroll the loops (if, in its wisdom, it feels that will speed it up). Be sure to turn on all optimizations - gcc has an option -funroll-all-loops.

By passing the relevant parameters as template parameters, the compiler is able to do more optimizations. But I'm not sure that it will choose to do so! Experiments are required.


Here is the full program that I used for timing experiments.

#include<array>
#include<iostream>
using namespace std;

/*
 * L is a positive integer
 * D is vector of booleans of length L
 * E is a vector of ints [0,L) of length L
 * i will be in [0,L) also, therefore it is small enough that we can
 *         treat it as if it's known at compile time also
 *
 * The only thing that is *not* known at compile time is:
 * X is a vector of ints of length L
 *
 * Therefore, our goal is something like:
 *
 *   template<int L, int i, int D[L], int E[L]>
 *   int compute(int X[L]);
 */

template<int L, int i, const bool (&D)[L], const int (&E)[L]> // arrays passed, by reference, at compile-time
typename enable_if< D[i]==0 , int> :: type
comp_tpl(int (&)[L]) {
        return 10;
}
template<int L, int i, const bool (&D)[L], const int (&E)[L]> // arrays passed, by reference, at compile-time
typename enable_if< D[i]==1 , int> :: type
comp_tpl(int (&X)[L]) {
    int v = 0;
    //using confirm_Ei_is_known_at_compile_time = array<char,E[i]>;
    for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    return v;
}

template<int L, int i, const bool (&D)[L], const int (&E)[L]> // arrays passed, by reference, at compile-time
int
comp_tpl_simple(int (&X)[L]) {
    if (D[i] == 0) // D[i] known at compile-time
        return 10;
    int v = 0;
    using confirm_Ei_is_known_at_compile_time = array<char,E[i]>;
    for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    return v;
}

template<int L> // arrays passed, by reference, at compile-time
int
comp_rt(int i, const bool (&D)[L], const int (&E)[L], int (&X)[L]) {
    if (D[i] == 0) // D[i] known at compile-time
        return 10;
    int v = 0;
    for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    return v;
}


constexpr int L = 5;
extern constexpr bool D[L] {0, 1, 1, 0, 1};  // Values in {0, 1}
extern constexpr int  E[L] {1, 0, 3, 2, 4}; // Values in [0, L-1]

void change_X_arbitrarily(int (&X)[L]) {
    for(int j=0; j<L; ++j)
        ++X[j];
}

int main() {
    int X[L] {1, 3, 5, 7, 9}; // Any integer

#ifdef USE_RUNTIME
    #define comp(L,i,D,E,X) comp_rt<L>(i,D,E,X)
#endif
#ifdef USE_TEMPLATE
    #define comp(L,i,D,E,X) comp_tpl_simple<L,i,D,E>(X)
#endif

    int total=0;
    for(int outer_reps=0; outer_reps<10000; ++outer_reps) {
        for(int inner_reps=0; inner_reps<100000; ++inner_reps) {
            total += comp(L,0,D,E,X);
            total += comp(L,1,D,E,X);
            total += comp(L,2,D,E,X);
            total += comp(L,3,D,E,X);
            total += comp(L,4,D,E,X);
        }
        change_X_arbitrarily(X);
    }

    cout << total << endl; // should be 39798784
}

Note how I use a #define to select which function to use. I compile and run:

$ clang++ SO.cpp -std=gnu++0x -O3 -DUSE_TEMPLATE -o SO && time -p ./SO
39798784  // the total value from all the calls, as a check
real 0.00
user 0.00
sys 0.00

It takes zero seconds to compute 1,000,000,000 times! But the runtime version takes 2.7 seconds

$ clang++ SO.cpp -std=gnu++0x -O3 -DUSE_RUNTIME -o SO && time -p ./SO
39798784  // the total value from all the calls, as a check
real 2.70
user 2.68
sys 0.00

I used clang3.3 with -O3 there.

When using g++ 4.8.2, I get a warning about undefined behaviour with -O3, but bizarrely the runtime is zero seconds with either the runtime or template version! Perhaps g++ is enabling the compile time tricks for us, even in 'runtime' mode. The lesson here is that compilers really can know much more about optimization than us!

Anyway, if I fall back to g++-4.8.2 -O2 then the runtime is 6.8 seconds in either case. Quite bizarre! Sometimes adding more O slows it down!

An explanation: In this case, X is actually known at compile-time. It's a local variable in this code, and is updated deterministically, so the compiler is able to fully predict it and will compute the answer at compile time! It appears g++ is doing this (very impressive!). Therefore, in my latest experiments, I moved X outside of main to be a global variable and now the optimizations behave 'as expected'. The comp_tpl is consistently very much faster than comp_rt now.

Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88
  • Uhm interesting, my gcc (4.8.2) does not do this kind of optimization, and the function is computed anyway at runtime even if `X` does not change! – AkiRoss Nov 12 '13 at 16:02
  • I've just installed g++ 4.8.2 also. I've noticed that simply removing `const` from `const int E[L] = ...` changes the runtime from 0 seconds to 11 seconds. (Using comp_rt and -O3 in both cases) This emphasisizes the importance of how D and E are declared, in particular E. `constexpr` would be the clearest hint to the compiler. And also it shows that there are a lot of different issues interacting here. Perhaps you cannot show us all your code, but could you show us the declarations? The declaration of the function you're testing and the declaration of the D and E variables? – Aaron McDaid Nov 12 '13 at 16:19
1

(Sorry to add another answer)

(I'll put sample code at the end)

My experiments, and further thought, have convinced me that that original code can be used, as is, with only minor modifications. The compilers are so good at doing the optimization, that I find it difficult to slow it down sometimes! Only by marking X volatile, or constantly editing it with random data from rand(), can I really make the "runtime" version slow.

First, if you have only one D vector and only one E vector then you simply need to put constexpr in front of the array declarations.

constexpr int D[] = { 0, 1, 1, 0, 1 };
constexpr int E[] = { 1, 0, 3, 2, 4 };

(If you have multiple such vectors, and you want to prepare 'pre-partially-compiled' functions for each of them, we could pass them via template parameters, as discussed in my other (long-winded) answer.)

We also need to deal with the i index in the original function: int comp_rt(int i, array<int, L> X);. It should be a template parameter:

template<size_t i>
int comp_rt(array<int, L> X);

The body of the function does not need to be changed. The compiler now knows that i, D[i], and E[i] are constant expressions. The expressions involving D[i] and E[i] are replaced by their constant values. The test if(D[i]==0) is replaced, at compile-time, by if (true) or if (false) as necessary. Also, the loops will be unrolled because the compiler knows exactly the value of E[i]. The loop is unrolled and the compiler can see that v is simply a long sum. In this case, it will replace it with the explicit sum, remove all the zero terms, and add up all the constant terms, and so on. All of this done at compile-time. There is pretty much nothing more we can do to help the compiler here. This is equivalent to some of the more complex template solutions that could be used.

Used g++ and clang++ with -O2 and -O3.

In some of my experiments, gcc's program ran in zero seconds, no matter how many iterations I required. The is because the algorithm is deterministic and gcc could compute, in advance, everything that would happen to X (even though I was changing X regularly!). In that case, part of the problem was that I had made X a local variable, but the lesson is that the compiler can see a deterministic program and compute everything in advance, even if you don't want it to! clang doesn't seem to optimize as aggressively here.

If you have some code that is slower than you hoped, and you can put together a complete piece of code demonstrating the slow code, then perhaps we could suggest other small changes. But I believe the simple use of constexpr on data, and a template parameter for i, will do the trick.


In this example code, I've made another change. I'm indirectly using tuple_size<array<char, D[i]> > :: value instead of simply D_i, this doesn't really change the meaning, but I think it will encourage older compilers to evaluate at compile time. My goal in this is to match the original readability of the code as much as possible, keeping this entire function together in one place for example instead of splitting it into many templates.

constexpr int L   = 5;
constexpr int D[] = { 0, 1, 1, 0, 1 };
constexpr int E[] = { 1, 0, 3, 2, 4 };
template<int i>
int comp_rt(array<int, L> X) {
    using D_i_type = array<char, D[i]>;
    int v = 0;
    if (tuple_size<D_i_type>::value == 0) // D[i] known at compile-time
        return 10;
    using E_i_type = array<char, E[i]>;
    for (int j = 0; j < tuple_size<E_i_type>::value; ++j) { // E[i] known at compile-time
        v += X[j] * (j + 1); // X[j] known at run-time
    }
    return v;
}
Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88
  • Well, actually in my program the `X` value has to be altered in every loop, so the function has to be called every time on different values, and using my handwritten template specialization is about 4-5 faster than using the plain function (with constant values). I tried using g++ -O2 and -Os but without significant improvements. Thanks for this second answer ;) – AkiRoss Nov 10 '13 at 17:48
  • Did you put `constexpr` if from of `D` and `E`, and are `D` and `E` arrays (`[]`)? And `i` is a `template` parameter? It's interesting if your g++ isn't optimizing this. Which version of g++? – Aaron McDaid Nov 10 '13 at 18:04
  • Not sure if I just used `const` or `constexpr` (sorry but I can't find the test code), but I think that probably it did not optimize the loops and maybe it did not optimize the multiplication by zero, so I guess I did not use `constexpr` to declare the arrays, but just `const`. – AkiRoss Nov 10 '13 at 18:12
  • I've added some code at the end of my answer. In this case, I've made another change that should encourage the compiler to calculate things at compile time. It's possible that a lazy (old?) compiler is unable/unwilling to fully compute constexprs, even when it can do so. By declaring an `array` type with `D[i]` and `E[i]` members respectively, it then must calculated at compile time. If you put some negative numbers into D and E, you'll see compiler errors and this confirms that the compiler is making some progress at compile time. – Aaron McDaid Nov 10 '13 at 18:26
-1

With constexpr functions, it can be done at compil-time

#include <iostream>
#include <array>
#include <utility>

#include <cstddef>
#include <type_traits>

  /// A type that represents a parameter pack of zero or more integers.
  template<typename T, T... I>
    struct integer_sequence
    {
      static_assert( std::is_integral<T>::value, "Integral type" );

      using type = T;

      static constexpr T size = sizeof...(I);

      /// Generate an integer_sequence with an additional element.
      template<T N>
        using append = integer_sequence<T, I..., N>;

      using next = append<size>;
    };

  template<typename T, T... I>
    constexpr T integer_sequence<T, I...>::size;

  template<std::size_t... I>
    using index_sequence = integer_sequence<std::size_t, I...>;

  namespace detail
  {
    // Metafunction that generates an integer_sequence of T containing [0, N)
    template<typename T, T Nt, std::size_t N>
      struct iota
      {
        static_assert( Nt >= 0, "N cannot be negative" );

        using type = typename iota<T, Nt-1, N-1>::type::next;
      };

    // Terminal case of the recursive metafunction.
    template<typename T, T Nt>
      struct iota<T, Nt, 0ul>
      {
        using type = integer_sequence<T>;
      };
  }


  // make_integer_sequence<T, N> is an alias for integer_sequence<T, 0,...N-1>
  template<typename T, T N>
    using make_integer_sequence = typename detail::iota<T, N, N>::type;

  template<int N>
    using make_index_sequence = make_integer_sequence<std::size_t, N>;


  // index_sequence_for<A, B, C> is an alias for index_sequence<0, 1, 2>
  template<typename... Args>
    using index_sequence_for = make_index_sequence<sizeof...(Args)>;
//--------------------My part starts here-------------------------------------------------------
template <size_t N> constexpr int computebis(int bound,std::array<int,N> X,int j)
{
  return (j<bound) ? X[j]*(j+1) + computebis(bound,X,j+1) : 0;
}

template <size_t N> constexpr int compute2(std::array<int,N> D,
                                            std::array<int,N> E,
                                            std::array<int,N> X,int index)
{
  return (D[index]==0) ? 10 : computebis(E[index],X,0);
}


template <size_t N,std::size_t... Indices> constexpr std::array<int,N> mfill(std::array<int,N> D,
                                                                            std::array<int,N> E,
                                                                            std::array<int,N> X,
                                                                            index_sequence<Indices...>)
{
  return {{ compute2(D,E,X,Indices)... }};
}

template <size_t N> constexpr std::array<int,N> mfill(std::array<int,N> D,std::array<int,N> E,std::array<int,N> X)
{
  return mfill(D,E,X,make_index_sequence<N>{});
}


int main(int argc, char *argv[])
{

  std::array<int,5> D= {0,1,1,0,1};
  std::array<int,5> E= {1,0,3,2,4};
  std::array<int,5> X= {1,3,5,7,9};
  //to be sure that it is done at compil time
  const auto X2 =  mfill(D,E,X);

  for(auto e:X2){
    std::cout<<e<<std::endl;
  }

Edit : code updated Inspired by Create N-element constexpr array in C++11 and I took all the first part there

Community
  • 1
  • 1
Davidbrcz
  • 2,335
  • 18
  • 27
  • Interesting, but not quite what I was looking for. See the updated question for details. – AkiRoss Nov 07 '13 at 00:11
  • 1
    But... Having `X` as parameter in a constexpr, doesn't it make required at compile-time? `X` is known at runtime and will change often. – AkiRoss Nov 07 '13 at 11:01
  • Marking a function as `constexpr` does not force it to be executed at compile time. It's little more than a hint to the compiler that allows it, if *all* parameters are known at compile time, to compute the answer at compile time. To confirm that this answer doesn't compute a constant expression, simply change `const auto X2 = mfill(D,E,X);` to `constexpr auto X2 = mfill(D,E,X);`, you'll get errors. For example, *"note: read of non-constexpr variable 'D' is not allowed in a constant expression"*. – Aaron McDaid Nov 08 '13 at 23:56
  • ... I also put `constexpr` in front of `D`, `E` and `X`, and then got this error "*non-constexpr function 'operator[]' cannot be used in a constant expression*". And, anyway, we can't have `X` as a constexpr in this question. – Aaron McDaid Nov 09 '13 at 00:00