Fact is that std::linear_congruential_engine<UIntType,a,c,m>::discard(unsigned long long z)
is definitely possible to implement very efficiently (and quite trivial actually). It is mostly equivalent to exponentiation of a
to the power of z
modulo m
(for both zero and non-zero c
) - it means quite simple software implementation executes in O(log(z % φ(m)))
UIntType
multiplication + modulo reduction operations (φ(m)=m-1
- for prime number and less than m
in general case).
^^^NOTE that in a way O(log(z % φ(m)))
is O(1)
because log2(z % φ(m)) < log2(m) < sizeof(UIntType)*CHAR_BIT
- though in practice it is more like O(log(z))
.
^^^ALSO NOTE that after generating some template-parameter-dependent pre-computed tables (either lazy or compile-time computation may be employed if suitable) exponentiation complexity may be reduced to just a few multiplication + modulo reduction
operations (like 4 or 8) - i.e. O(1)
in every practical sense.
Also probably there are efficient enough algorithms for most other engines' discard
functions that would satisfy O(P(sizeof(state))*log(z))
constraint (P(x)
- some low degree polynomial function, most likely 1+o(1)
degree or at most 3+o(1)
degree, given that log(z) < sizeof(unsigned long long)*CHAR_BIT
may be considered as constant).
DESPITE ALL THIS:
Somehow for some unknown reason C++ standard (as of ISO/IEC 14882:2017) does not require to implement discard
in more efficient way than just z
operator()()
calls for any PRNG engine including those that definitely allow this.
To me personally it is baffling and just MAKES NO SENSE - it violates one of the fundamental C++ language design principles (in a very brutal way) - that is to add to the C++ standard only reasonable functionality in terms of performance and practical usefulness.
Notable and very documented example: there is no such thing as access of std::list
element by index (std::list<T>::operator[](size_type n)
) even though it is as "easy" as to just call operator++()
n
times with iterator begin()
. And naturally so - because O(n)
execution time would make this function unreasonable choice in any practical application (AKA plain stupid idea). For this obvious reason a[n]
and a.at(n)
is not part of mandatory Sequence container requirements (ISO/IEC 14882:2017 26.2.3 Table 87) but instead is a part of Optional sequence container operations (ISO/IEC 14882:2017 26.2.3 Table 88).
SO why in the world then e.discard(z)
is part of mandatory Random number engine requirements (ISO/IEC 14882:2017 29.6.1.4 Table 104) with this ridiculous complexity requirement - no worse than the complexity of z consecutive calls e()
- instead of some optional operations section entry with adequate complexity requirement like O(P(sizeof(state))*log(z))
?
Like WHAT the...? z consecutive calls e()
is exponential complexity - Since when it is OK?
Even more baffling was to actually find in my GCC this real-world implementation:
void discard(unsigned long long __z)
{
for (; __z != 0ULL; --__z)
(*this)(); //<-- Wait what? Are you kidding me?
}
So once again as it was before we have no other choice than to implement the necessary functionality ourselves... Not much help from C++ standard library.
AMENDMENT:
When diving deeper into design details of Mersenne Twister we discover that discard(z)
of std::mersenne_twister_engine also can be implemented quite efficiently.
For
template<
class UIntType,
std::size_t w, std::size_t n, std::size_t m, std::size_t r, UIntType a,
std::size_t u, UIntType d, std::size_t s, UIntType b, std::size_t t, UIntType c,
std::size_t l, UIntType f
> class mersenne_twister_engine;
even generic implementation of discard(z)
(applicable to the whole class of Linear PRNGs modulo 2 - not just Mersenne Twister but also WELL and many others) would have complexity like O(n^ω*log(z))
- where n
is template parameter above - size of state in w
-bit words, and power ω is constant between 2 and 3 (depending on a chosen bit matrix multiplication algorithm). That complexity may be trivially reduced with reasonable amount of template-parameter-dependent pre-computation to O(n^ω)
practical execution time. SIMD CPU instructions (vector XOR or vector AND) would improve practical performance by a constant factor. Parallel algorithms (e.g. specialized hardware solutions) would compute this in O(log(n))
time using O(n^3)
simultaneous single-bit computations (XORs and ANDs).
Sure you may notice that n
parameter above is typically not that small (like 624 for std::mt19937
or 312 for std::mt19937_64
) and n
-cubed is even bigger - so O(n^3)
is not necessarily fast actual execution. But modern CPU (and especially GPU) would still execute optimized implementation quite fast - no comparison to ridiculous exponential complexity of z consecutive calls of operator()()
.
SOME GENERAL OBSERVATIONS:
Each existing PRNG (and each one I can imagine) can be defined with the following iterative equations:
x[n+1] = T(x[n])
Output[n] = F(x[n])
where x[n]
is a state (some W
-bit sequence) after n
iterations (so x[0]
is initial state AKA seed), T(x)
is a state iterative transformation function (that converts current state to the next state) and F(x)
is output transformation that converts each state W
-bit sequence to output v
-bit sequence (Output[n]
) - typically v < W
.
Sure both T(x)
and F(x)
computations are fast - i.e. maximum time is polynomial - O(P(W))
at worst. (Typically those functions are designed to be even faster - like O(P(v))
which is essentially O(1)
in most cases because v
is typically chosen to be CPU register size with fast hardware-optimized operations usually available for that size).
I mean literally - all existing (and future) PRNGs that make sense can be expressed in this way.
(The only further generalization I can think of is making W
and v
sizes non-constant - i.e. dependent on n
- i.e. changing from one iteration to the next. Probably there's not much practical sense in this - I guess no one wants their PRNG internal data to grow infinitely and eventually consume all RAM or something like that. Even though very slowly growing W
could allow non-periodic PRNG designs.)
So the question is: What property of PRNG would make discard(z)
operation fast - i.e. with polynomial - O(P(W))
- worst-case running time?
IMO quite obvious answer is that IF we can perform fast computation of the function - T^z(x) = T(T(...T(x)...)) - z times
for any z
then we can implement fast discard(z)
.
Also it is not very hard to notice that IF T(x) = T_p(x)
is some parametrized transformation with some internal fixed parameter p
which is one of a class of transformations with varying parameter values and for any allowed parameter value q
transformation T_q(x)
can be computed fast - in O(P(W))
time. And also if for any allowed parameter values p
and q
transformation T_p(T_q(x))
is also in this class of transformations with some allowed parameter r
- i.e. T_p(T_q(x)) = T_r(x)
and r
can be computed fast from parameters p
and q
... Say we define a notation r=p*q
- where *
is some binary operation computable fast (in polynomial time at most) - so T_{p*q}(x) = T_p(T_q(x))
by definition. (You can notice that binary operation *
is not necessarily commutative - i.e. p*q
must not be the same value as q*p
. But this operation is associative by design - because T_p(T_q(T_r(x))) = T_p(T_{q*r}(x)) = T_{p*q}(T_r(x))
- hence p*(q*r)=(p*q)*r
.)
^^^This T(x)
transformation structure would obviously allow fast computation of T^z(x)
transformation: if T(x) = T_p(x)
and parameter p
is known - we just compute q=p^z=p*p*p*...p - z times
(which is just O(log(z))
computations of *
and can be optimized by pre-computations and/or parallel execution) and then we compute T_q(x)
.
^^^ While this many preconditions seem like very special case - in fact all this is quite common. E.g. for a class of linear transformations modulo 2 (like Mersenne Twister or WELL, etc) iterative transformation can be represented as multiplication of constant bit matrix by a state bit vector in modulo 2 arithmetic - so that constant matrix exponentiation (in modulo 2 bitwise arithmetic) does the trick. With std::linear_congruential_engine
it is even simpler - do the math yourself as an easy exercise. Elliptic-curve based PRNG also have those conditions met. (Actually I'm wondering why would anyone design PRNG without this very useful property. - But that's just me.)