For a monte carlo integration process, I need to pull a lot of random samples from a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but doesn't change at all during the course of the computation.
By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any kind of lengthy precomputation is likely worth it in the face of the sheer number of samples).
I have at my disposal a very fast uniform pseudo random number generator that typically produces unsigned 64 bits integers (all the ints in the discussion below are unsigned).
The naive way to pull a sample : histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation is using an integer division (IDIV)
which is terribly expensive and the compiler, not knowing the value of histogram.size()
at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)
As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.
The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable of pulling off a very fast "divide by a constant not known at compile time".
That gives me a very nice win (25% or so), but I have a nagging feeling that I can do better, here's why:
First intuition: libdivide computes a division. What I need is a modulo, and to get there I have to do an additional mult and a sub :
mod = dividend - divisor*(uint64_t)(dividend/divisor)
. I suspect there might be a small win there, using libdivide-type techniques that produce the modulo directly.Second intuition: I am actually not interested in the modulo itself. What I truly want is to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.
The modulo is a fairly standard way of getting there, because of two of its properties:
A)
mod(prng(), N)
is guaranteed to be uniformly distributed ifprng()
isB)
mod(prgn(), N)
is guaranteed to belong to [0,N[
But the modulo is/does much more that just satisfy the two constraints above, and in fact it does probably too much work.
All need is a function, any function that obeys constraints A) and B) and is fast.
So, long intro, but here comes my two questions:
Is there something out there equivalent to libdivide that computes integer modulos directly ?
Is there some function F(X, N) of integers X and N which obeys the following two constraints:
- If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
- F(X, N) is guranteed to be in [0, N[
(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of the PRNG. As a matter of fact, I already do that. But like I said, even that optimization is a minor win when compare to the big fat loss of having to compute a modulo).
Edit : prng() % N
is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)
Edit 2 : prng() % N
is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique