11

I would like to wrap the random number distributions from the C++11 standard library with simple functions that take as arguments the distribution's parameters and a generator instance. For example:

double normal(double mean, double sd, std::mt19937_64& generator)
{
    static std::normal_distribution<double> dist;
    return dist(generator, std::normal_distribution<double>::param_type(mean, sd));
}

I want to avoid any hidden state within the distribution object so that each call to this wrapper function only depends on the given arguments. (Potentially, each call to this function could take a different generator instance.) Ideally, I would make the distribution instance static const to ensure this; however, the distribution's operator() is not a const function, so this isn't possible.

My question is this: To ensure there is no hidden state within the distribution, is it 1) necessary and 2) sufficient to call reset() on the distribution each time? For example:

double normal(double mean, double sd, std::mt19937_64& generator)
{
    static std::normal_distribution<double> dist;
    dist.reset();
    return dist(generator, std::normal_distribution<double>::param_type(mean, sd));
}

(Overall, I'm confused about the purpose of the reset() function for the random distributions... I understand why the generator would need to be reset/reseeded at times, but why would the distribution object need to be reset?)

Praetorian
  • 106,671
  • 19
  • 240
  • 328
Tyler Streeter
  • 1,214
  • 12
  • 14
  • Thanks for the very helpful answers and comments everyone! – Tyler Streeter Feb 13 '13 at 20:25
  • I think I'll change my overall strategy after reading the responses: I'll either bind my generator and distribution (which seems to be the intended usage, rather than allowing multiple generators to use the same distribution object), or I may just use the std lib's generators but ignore the distribution functions (after finding out that the distributions are not necessarily portable: http://stackoverflow.com/questions/14840901/is-the-random-library-in-c11-portable, so, even with the same seed, I may get different sequences of e.g. normal distribution samples on different platforms). – Tyler Streeter Feb 13 '13 at 20:28
  • It seems that there are two use cases here: 1) binding an engine and distribution (the intended use), and 2) using multiple engines with one distribution (what I am doing here). I suppose I prefer to think of the distributions as simple stateless functions: you give them an engine, and they return a sample. But I understand the need to treat them as objects so they can cache values for efficiency. It would have been nice if the standard provided, for each distribution, both a stateless function to get just *one* sample, and an object for generating/caching *several* samples. – Tyler Streeter Feb 14 '13 at 16:58
  • Also, to summarize the answers below, it seems that there are two potential problems with constantly calling `reset()` (or recreating the distribution) before generating each sample: 1) it might decrease efficiency because any cached values within the distribution are lost, and 2) (even more serious, if true) it might produce an incorrect distribution. – Tyler Streeter Feb 14 '13 at 17:04

3 Answers3

8

To ensure there is no hidden state within the distribution, is it 1) necessary

Yes.

and 2) sufficient to call reset() on the distribution each time?

Yes.

You probably don't want to do this though. At least not on every single call. The std::normal_distribution is the poster-child for allowing distributions to maintain state. For example a popular implementation will use the Box-Muller transformation to compute two random numbers at once, but hand you back only one of them, saving the other for the next time you call. Calling reset() prior to the next call would cause the distribution to throw away this already valid result, and cut the efficiency of the algorithm in half.

Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • 1
    Whoa, so what if I have two generator instances, gen1 and gen2, and a std::normal_distribution instance normdist, and I call normdist(gen1) followed by normdist(gen2). Then the second call will depend on gen1 and not gen2? Is that true? – Tyler Streeter Feb 13 '13 at 16:44
  • 2
    @TylerStreeter: It is true for the libc++ implementation. Table 118 which specifies the behavior of calling a distribution with a urng, specifies behavior only for "successive invocations with the **same** object `g`." Successive calls with different urng's is not specified. – Howard Hinnant Feb 13 '13 at 16:59
  • Ok, that's very helpful (and confusing!) Furthermore, for the operator(gen, params) version, the standard specifies behavior only for "successive invocations with the same objects g and p." So if I have two generators (gen1 and gen2) and two sets of normal params (p1 and p2), and I call normdist(gen1, p1) followed by normdist(gen2, p2), the second call *may* depend on gen1 and p1 and not gen2 and p2 at all. Not what I expected! – Tyler Streeter Feb 13 '13 at 18:23
  • I was thinking about this a bit more... Say we have just one generator instance g, a std::bernoulli_distribution b, and two different bernoulli parameters p1 and p2. If I call b(g,p1) followed by b(g,p2), both samples may use p1 and ignore p2, and nothing in the standard prevents that. Again, not what I expected. – Tyler Streeter Feb 13 '13 at 22:44
  • Another example: the sample implementation of std::random_shuffle here http://en.cppreference.com/w/cpp/algorithm/random_shuffle (2nd version) may be invalid because, again, the same distribution is reused with different parameters. – Tyler Streeter Feb 13 '13 at 22:45
  • @TylerStreeter: I think you're probably right, but that this is a defect in the standard rather than an intended restriction. I think the intent is that `d(g,p)` is the same as `d.param(p); d(g)`, or maybe it's supposed to be `auto tmp = d.param(); d.param(p); d(g); d.param(tmp);`. But the semantics of the expression `d(g,p)` don't actually say so. What I don't think it's supposed to do, though, is make future calls with different `p` have an unspecified distribution, because if that's the case then the `d(g,p)` expression is basically useless. – Steve Jessop Feb 14 '13 at 10:15
  • @HowardHinnant: lazy of me, I know, I could look it up. But: what happens in the libc++ implementation of `normal_distribution` if you call it once with one set of params, so it caches a value, and then call it again with different params? Does it do what my comment above says I think is probably the intent of the standard, and return a value scaled according to the new parameters? Or does it do what I say makes `d(g,p)` basically useless, cache the already-scaled value on the first call, and return the cached value ignoring the second set of parameters? – Steve Jessop Feb 14 '13 at 10:26
  • @SteveJessop: Yes, I was implying that this may be a defect in the standard. Regarding the libc++ implementation of `normal_distribution::operator()` (http://llvm.org/svn/llvm-project/libcxx/trunk/include/random), fortunately, it looks like it always uses the supplied parameters (probably the intent of the standard, as you said). However, it will sometimes use the supplied generator and sometimes use a cached value based on the previously supplied generator (possibly *not* the intent of the standard). So `d(g)` is, in your words, basically useless. But `d(g,p)` is only half useless. – Tyler Streeter Feb 14 '13 at 14:09
  • @TylerStreeter: we're using slightly different definitions of "useless", I think. Under the conditions I was describing, `d(g,p)` would be basically useless because you could just as well provide `p` via the constructor or `d.param(p)`. Since there's no way to permanently register a generator with a distribution, `d(g)` would still be useful in the sense that you have to specify `g` somehow. But ISWYM, arguments where there's only one possible input value for which the results are guaranteed to be "right" are pretty annoying whether it's `g` or `p`. – Steve Jessop Feb 14 '13 at 15:31
2

Some distributions have internal state. If you interfere with how the distribution works by constantly resetting it you won't get properly distributed results. This is just like calling srand() before every call to rand().

Pete Becker
  • 74,985
  • 8
  • 76
  • 165
  • 1
    Are you sure it will affect the distribution and not just the efficiency (see @HowardHinnant's answer)? I'm not sure that your srand() analogy is correct (but correct me if I'm wrong). It seems to me that the distribution's internal state should only be used to improve efficiency, but resetting it should not affect the distribution (like resetting the seed would). – Tyler Streeter Feb 13 '13 at 17:06
  • @TylerStreeter: Although only efficiency is impacted in the libc++ implementation, I don't claim that is a portable result. I doubt the designers or implementors of this library ever imagined someone would want to `reset` a distribution prior to every call. – Howard Hinnant Feb 13 '13 at 17:17
  • @HowardHinnant: so do you see a potential problem with this use case as well: http://stackoverflow.com/questions/8433421/should-i-keep-the-random-distribution-object-instance-or-can-i-always-recreate-i where, rather than constantly calling reset, the distribution object is constantly recreated? – Tyler Streeter Feb 13 '13 at 17:55
  • 2
    It might depend on the distribution, but if the first return value out of a newly-created or newly-reset distribution object doesn't have the required distribution, then normally I'd think that means the distribution implementation is flawed. For example, if the first value out of a `normal_distribution` object isn't normally distributed, that's a bug isn't it? Or are distributions allowed to return incorrectly-distributed data to begin with and then make up for it in the long run, using their stored internal state? If so, what does that mean for users of small amounts of RNG data? – Steve Jessop Feb 13 '13 at 18:02
  • 2
    @TylerStreeter: Unless I was sure of the implementation, constantly recreating the distribution object would make me nervous. The libc++ implementation of `uniform_int_distribution` is trivial. However not all libc++ distributions have trivial constructors. Though for the libc++ implementation, I still think this is only a matter of efficiency. I don't have an example distribution/implementation in mind where a correctness problem might arise. However I know just enough about this domain to know that one should not abuse the API if one wants to get high quality output. – Howard Hinnant Feb 13 '13 at 18:09
  • 2
    @SteveJessop - a single value does not have a distribution; a collection of values has a distribution, and there is no requirement that you be able to call `reset` repeatedly without disrupting what the distribution object is doing. – Pete Becker Feb 13 '13 at 19:47
  • @HowardHinnant - yes, it's probably typically a matter of efficiency, but there is no requirement that repeatedly calling `reset` won't disrupt things. – Pete Becker Feb 13 '13 at 19:48
  • 1
    @PeteBecker: There isn't really space in comments for the required mathematical formalism, but "the first value out of a distribution object" is a random variable (assuming the generator backing it is random, of course), and as such it does have distribution. It is true that one particular value so obtained does not have a distribution, but that does not imply (as you seem to suggest) that it doesn't matter how that first value is generated. The issue is not whether calling `reset` disrupts/affects the distribution object (of course it does), the issue is how the resulting data is distributed. – Steve Jessop Feb 14 '13 at 09:14
  • 1
    ... so for example, it seems to me that the standard guarantees that `uniform_int_distribution(0,1)(my_generator)` has a 50-50 chance of returning 0. This is subject to the quality of `my_generator`, but not subject to any caveats about the implementation of `uniform_int_distribution` being permitted by the standard to produce poor-quality data shortly after being created. The same applies to a sequence of values each of which was obtained immediately after calling `reset`, and to any other distribution object whose probability distribution is defined in the standard. – Steve Jessop Feb 14 '13 at 09:25
  • @SteveJessop - there is no requirement that interfering with a distribution by repeatedly calling `reset` not disrupt the results. You can argue that this is unlikely, and I'd agree, but that doesn't turn it into a requirement, and it does not mean that calling `reset` cannot have consequences. – Pete Becker Feb 14 '13 at 13:04
  • @SteveJessop - you're too quick. That comment was only there for about ten seconds before I removed it! – Pete Becker Feb 14 '13 at 13:10
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24517/discussion-between-steve-jessop-and-pete-becker) – Steve Jessop Feb 14 '13 at 13:12
  • @SteveJessop - my apologies; I lost my internet connection for several hours. – Pete Becker Feb 14 '13 at 16:48
1

Calling reset() on a distribution object d has the following effect:

Subsequent uses of d do not depend on values produced by any engine prior to invoking reset.

(an engine is in short a generator that can be seeded).

In other words, it clears any "cached" random data that the distribution object has stored and that depends on output that it has previously drawn from an engine.

So, if you want to do that then you should call reset(). The main reason I can think of that you would want to do that is when you are seeding your engine with a known value with the intention of producing repeatable pseudo-random results. If you want the results from your distribution object to also be repeatable based on that seed, then you need to reset the distribution object (or create a new one).

Another reason I can think of is that you are defensively reseeding the generator object because you fear that some attacker may gain partial knowledge of its internal state (as for example Fortuna does). To over-simplify, you can imagine that the quality/security of the generator's data diminishes over time, and that reseeding restores it. Since a distribution object can cache arbitrary amounts of data from the generator, there will be an arbitrary delay between increasing the quality/security of the output of the generator, and increasing the quality/security of the output of the distribution object. Calling reset on the distribution object avoids this delay. But I won't swear to this latter use being correct, because it gets into the realms where I prefer not to make my own judgement about what is secure, if I can possibly rely on peer-reviewed work by an expert :-)

With regard to your code in particular -- if you don't want the output to depend on previous use of the same dist object with different generator objects, then calling reset() would be the way to do that. But I think it's unlikely that calling reset on a distribution object and then using it with new parameters will be any cheaper than constructing a new distribution object with those parameters. So using a static local object seems to me to make your function non-thread-safe for no benefit: you could create a new distribution object each time and the code would likely be no worse. There are reasons for the design in the standard, and you're expected to use a distribution object repeatedly with the same generator. The function you've written, cutting the distribution object out of the interface, discards the benefits of that part of the design in the standard.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • The primary purpose of `reset` is to get things back to a known state so that you can reproduce a prior sequence or resume where you left off. – Pete Becker Feb 14 '13 at 13:06