(Update: timing experiments with clang++ and g++ are discussed at the end. Also, for simplicity, I use the exact body for comp_rt
in the question, demonstrating that it can optimize fully without us needing rewrite the body of the function.)
Yes, this can be done. But, g++ seems to do a lot of this stuff for you anyway without you realising it, see experiment at the end. With clang++ however, you really can see the runtime version being slower.
In the program below, all parameters, apart from X
, are passed as template args. Therefore, there will be a different comp_rt
template function built for every combination of parameters you use.
This could cause your binary to grow large if L
is large.
The way I've handed D[i]==0
might be difficult to understand at first.
I've placed it inside an enable_if
. There are two definitions of comp_tpl
here, one that is used when D[i]==0
and that is used when D[i]==1
. To be honest, this is probably unnecessary, I suspect the code would still be compiled optimally even if you just used the original body of the function inside a single comp_rt
function template. (I've removed this complication).
I've include a line like this inside the function:
using confirm_Ei_is_known_at_compile_time = array<char,E[i]>;
This confirms that E[i]
is known to the compiler at compile time. This is equivalent to a typedef, and the number of elements in an array
must be known at compile. If, for example, you attempted to use X[i]
instead of E[i]
as the size of the array
, the compiler would reject the code. Note: this line does nothing, it's just a sanity check at compile time.
Finally, given that E[i]
is known at compile time, the compiler is able to unroll the loops (if, in its wisdom, it feels that will speed it up). Be sure to turn on all optimizations - gcc has an option -funroll-all-loops
.
By passing the relevant parameters as template parameters, the compiler is able to do more optimizations. But I'm not sure that it will choose to do so! Experiments are required.
Here is the full program that I used for timing experiments.
#include<array>
#include<iostream>
using namespace std;
/*
* L is a positive integer
* D is vector of booleans of length L
* E is a vector of ints [0,L) of length L
* i will be in [0,L) also, therefore it is small enough that we can
* treat it as if it's known at compile time also
*
* The only thing that is *not* known at compile time is:
* X is a vector of ints of length L
*
* Therefore, our goal is something like:
*
* template<int L, int i, int D[L], int E[L]>
* int compute(int X[L]);
*/
template<int L, int i, const bool (&D)[L], const int (&E)[L]> // arrays passed, by reference, at compile-time
typename enable_if< D[i]==0 , int> :: type
comp_tpl(int (&)[L]) {
return 10;
}
template<int L, int i, const bool (&D)[L], const int (&E)[L]> // arrays passed, by reference, at compile-time
typename enable_if< D[i]==1 , int> :: type
comp_tpl(int (&X)[L]) {
int v = 0;
//using confirm_Ei_is_known_at_compile_time = array<char,E[i]>;
for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
v += X[j] * (j + 1); // X[j] known at run-time
return v;
}
template<int L, int i, const bool (&D)[L], const int (&E)[L]> // arrays passed, by reference, at compile-time
int
comp_tpl_simple(int (&X)[L]) {
if (D[i] == 0) // D[i] known at compile-time
return 10;
int v = 0;
using confirm_Ei_is_known_at_compile_time = array<char,E[i]>;
for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
v += X[j] * (j + 1); // X[j] known at run-time
return v;
}
template<int L> // arrays passed, by reference, at compile-time
int
comp_rt(int i, const bool (&D)[L], const int (&E)[L], int (&X)[L]) {
if (D[i] == 0) // D[i] known at compile-time
return 10;
int v = 0;
for (int j = 0; j < E[i]; ++j) // E[i] known at compile-time
v += X[j] * (j + 1); // X[j] known at run-time
return v;
}
constexpr int L = 5;
extern constexpr bool D[L] {0, 1, 1, 0, 1}; // Values in {0, 1}
extern constexpr int E[L] {1, 0, 3, 2, 4}; // Values in [0, L-1]
void change_X_arbitrarily(int (&X)[L]) {
for(int j=0; j<L; ++j)
++X[j];
}
int main() {
int X[L] {1, 3, 5, 7, 9}; // Any integer
#ifdef USE_RUNTIME
#define comp(L,i,D,E,X) comp_rt<L>(i,D,E,X)
#endif
#ifdef USE_TEMPLATE
#define comp(L,i,D,E,X) comp_tpl_simple<L,i,D,E>(X)
#endif
int total=0;
for(int outer_reps=0; outer_reps<10000; ++outer_reps) {
for(int inner_reps=0; inner_reps<100000; ++inner_reps) {
total += comp(L,0,D,E,X);
total += comp(L,1,D,E,X);
total += comp(L,2,D,E,X);
total += comp(L,3,D,E,X);
total += comp(L,4,D,E,X);
}
change_X_arbitrarily(X);
}
cout << total << endl; // should be 39798784
}
Note how I use a #define
to select which function to use. I compile and run:
$ clang++ SO.cpp -std=gnu++0x -O3 -DUSE_TEMPLATE -o SO && time -p ./SO
39798784 // the total value from all the calls, as a check
real 0.00
user 0.00
sys 0.00
It takes zero seconds to compute 1,000,000,000 times! But the runtime version takes 2.7 seconds
$ clang++ SO.cpp -std=gnu++0x -O3 -DUSE_RUNTIME -o SO && time -p ./SO
39798784 // the total value from all the calls, as a check
real 2.70
user 2.68
sys 0.00
I used clang3.3 with -O3
there.
When using g++ 4.8.2, I get a warning about undefined behaviour with -O3, but bizarrely the runtime is zero seconds with either the runtime or template version! Perhaps g++ is enabling the compile time tricks for us, even in 'runtime' mode. The lesson here is that compilers really can know much more about optimization than us!
Anyway, if I fall back to g++-4.8.2 -O2
then the runtime is 6.8 seconds in either case. Quite bizarre! Sometimes adding more O
slows it down!
An explanation: In this case, X
is actually known at compile-time. It's a local variable in this code, and is updated deterministically, so the compiler is able to fully predict it and will compute the answer at compile time! It appears g++ is doing this (very impressive!). Therefore, in my latest experiments, I moved X
outside of main
to be a global variable and now the optimizations behave 'as expected'. The comp_tpl
is consistently very much faster than comp_rt
now.