3

For a monte carlo integration process, I need to pull a lot of random samples from a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but doesn't change at all during the course of the computation.

By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any kind of lengthy precomputation is likely worth it in the face of the sheer number of samples).

I have at my disposal a very fast uniform pseudo random number generator that typically produces unsigned 64 bits integers (all the ints in the discussion below are unsigned).

The naive way to pull a sample : histogram[ prng() % histogram.size() ]

The naive way is very slow: the modulo operation is using an integer division (IDIV) which is terribly expensive and the compiler, not knowing the value of histogram.size() at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)

As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.

The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable of pulling off a very fast "divide by a constant not known at compile time".

That gives me a very nice win (25% or so), but I have a nagging feeling that I can do better, here's why:

  • First intuition: libdivide computes a division. What I need is a modulo, and to get there I have to do an additional mult and a sub : mod = dividend - divisor*(uint64_t)(dividend/divisor). I suspect there might be a small win there, using libdivide-type techniques that produce the modulo directly.

  • Second intuition: I am actually not interested in the modulo itself. What I truly want is to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.

The modulo is a fairly standard way of getting there, because of two of its properties:

  • A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is

  • B) mod(prgn(), N) is guaranteed to belong to [0,N[

But the modulo is/does much more that just satisfy the two constraints above, and in fact it does probably too much work.

All need is a function, any function that obeys constraints A) and B) and is fast.

So, long intro, but here comes my two questions:

  • Is there something out there equivalent to libdivide that computes integer modulos directly ?

  • Is there some function F(X, N) of integers X and N which obeys the following two constraints:

    • If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
    • F(X, N) is guranteed to be in [0, N[

(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of the PRNG. As a matter of fact, I already do that. But like I said, even that optimization is a minor win when compare to the big fat loss of having to compute a modulo).

Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)

Edit 2 : prng() % N is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique

blondiepassesby
  • 181
  • 1
  • 7
  • 1
    (1) - On most platforms, the remainder is calculated "for free" as the result of a division at the hardware level. (2) - Neither modulo nor divide gives you an unbiased result. – Oliver Charlesworth Aug 30 '14 at 15:40
  • 3
    "A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is" is only true if `N` evenly divides `M`, where `prng()` returns numbers uniformly in `[0, M[`. – huon Aug 30 '14 at 15:42
  • @OliCharlesworth : (1) that might be the case, but it's still slow. (2) True, and I'll edit my question, but the error between the perfect uniform distribution and prng()%N is not something I'm particularly worried about because N is large. – blondiepassesby Aug 30 '14 at 15:45
  • 1
    Have you try `std::uniform_int_distribution` ? – Jarod42 Aug 30 '14 at 15:50
  • What is the range of `N` ? – Jarod42 Aug 30 '14 at 15:54
  • With some mathematics, I think you may feed each of your bucket with a non-uniform distribution (so complexity would be linear with `N` and not with the number of samples). – Jarod42 Aug 30 '14 at 16:08
  • 1
    "Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)" It actually gets worse as `N` gets larger (i.e. small N is better), e.g. if `prng()` generates 0, 1, 2, ..., 15 uniformly then `prng() % 4` has a small bias away from `3` (`3` occurs 20% of the time, 0, 1, 2 occur 26%), but `prng() % 10` has a large bias towards 0, 1, ..., 5 (they occur 2 times more often than 6, 7, 8, 9). – huon Aug 30 '14 at 16:18
  • 1
    As @Jarod42 says, what is typical `N`? Assuming you're on a "modern" x86, N > (256kB/8B) = 32k will lead to spilling the L2 cache, which must surely become the dominant performance effect. – Oliver Charlesworth Aug 30 '14 at 16:30
  • 2
    You could try `histogram[ (int)(prng() * (HISTOGRAM_SIZE / (PRNG_MAX + 1.0))) ]`. Precompute the constant once per histogram. This compiles to one floating point multiply and an integer conversion. Use MMX or GPU implementation to do more than one at a time. But I agree with @OliCharlesworth that if you're blowing out the cache with random histogram access, this is a huge cost. – Gene Aug 30 '14 at 16:38
  • @Jarod42 : Looking at the implementation of std::uniform_int_distribution, I have little hope that it'll be efficient. And indeed, when I try it ... twice slower than naive modulo. – blondiepassesby Aug 30 '14 at 23:27
  • @Gene I just tried this, and it is - to my surprise - faster than the modulo. From what I read around the net, it also seems to be better distributed than the modulo as long as PRNG_MAX is large enough. I had discarded this idea as being obviously slower than the modulo. Just goes to show how bad preconceptions can be. – blondiepassesby Aug 31 '14 at 17:34
  • @blondiepassesby A lot of old preconceptions about what's slow and fast are blown away by modern processor architectures. Floating point multiply can be faster than integer division, and results can also be strongly affected by how busy the various arithmetic units are. If you have no other floating point computations in your program... You might get an additional speedup by forcing the multiplication to be `float` rather than `double`, but beware of the very limited precision of `float`. There are only 23 bits available: 8 million or so. – Gene Aug 31 '14 at 20:33

9 Answers9

3

Under the circumstances, the simplest approach may work the best. One extremely simple approach that might work out if your PRNG is fast enough would be to pre-compute one less than the next larger power of 2 than your N to use as a mask. I.e., given some number that looks like 0001xxxxxxxx in binary (where x means we don't care if it's a 1 or a 0) we want a mask like 000111111111.

From there, we generate numbers as follows:

  1. Generate a number
  2. and it with your mask
  3. if result > n, go to 1

The exact effectiveness of this will depend on how close N is to a power of 2. Each successive power of 2 is (obviously enough) double its predecessor. So, in the best case N is exactly one less than a power of 2, and our test in step 3 always passes. We've added only a mask and a comparison to the time taken for the PRNG itself.

In the worst case, N is exactly equal to a power of 2. In this case, we expect to throw away roughly half the numbers we generated.

On average, N ends up roughly halfway between powers of 2. That means, on average, we throw away about one out of four inputs. We can nearly ignore the mask and comparison themselves, so our speed loss compared to the "raw" generator is basically equal to the number of its outputs that we discard, or 25% on average.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 1
    I've implemented the rejection method you describe. It works reasonably well and does beat modulo both in terms of quality (uniformity) and speed. But for most N's, it's slower than the "normalization via a double" method, even with branch prediction hints. – blondiepassesby Aug 31 '14 at 23:48
2

If you have fast access to the needed instruction, you could 64-bit multiply prng() by N and return the high 64 bits of the 128-bit result. This is sort of like multiplying a uniform real in [0, 1) by N and truncating, with bias on the order of the modulo version (i.e., practically negligible; a 32-bit version of this answer would have small but perhaps noticeable bias).

Another possibility to explore would be use word parallelism on a branchless modulo algorithm operating on single bits, to get random numbers in batches.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • I implemented this. It works well and it is faster than the floating point method (20.34 secs for FP vs. 18.66 secs for your method, so about 9% faster). Nice. – blondiepassesby Sep 06 '14 at 20:22
1

Libdivide, or any other complex ways to optimize that modulo are simply overkill. In a situation as yours, the only sensible approach is to

  1. ensure that your table size is a power of two (add padding if you must!)

  2. replace the modulo operation with a bitmask operation. Like this:

    size_t tableSize = 1 << 16;
    size_t tableMask = tableSize - 1;
    
    ...
    
    histogram[prng() & tableMask]
    

A bitmask operation is a single cycle on any CPU that is worth its money, you can't beat its speed.

--

Note:
I don't know about the quality of your random number generator, but it may not be a good idea to use the last bits of the random number. Some RNGs produce poor randomness in the last bits and better randomness in the upper bits. If that is the case with your RNG, use a bitshift to get the most significant bits:

size_t bitCount = 16;

...

histogram[prng() >> (64 - bitCount)]

This is just as fast as the bitmask, but it uses different bits.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • What if you're not after a power-of-2? – Oliver Charlesworth Aug 30 '14 at 15:48
  • @cmaster : A 25% gain is not what I'd call overkill. As far as I'm concerned, it's pretty darn good. As for rounding up ... meh. My histogram is large, so the next power of two is "far". Also, the histogram is made of observed values from the real world, so your padding idea is - to say the least - not trivial: what would you "pad" with exactly ? – blondiepassesby Aug 30 '14 at 15:50
  • @OliCharlesworth I explicitely mentioned that: if your use case constrains you to a non-power-of-2 table size, padd it so that it becomes a power of two. Throw away all random numbers that would go into the padding, if you must, but make the physical table a power of two to avoid the division. – cmaster - reinstate monica Aug 30 '14 at 15:55
  • That padding can be anything, it can even be, that you compare the resulting address with the correct max, and discard it if it's not within range. Would save you the space overhead. However, if conditions are performance critical as well, so try to avoid it. In most cases, it is possible to get away with simply doing some bogus calculations on the padding part, and it does not pay off to avoid it. Of course, you have overhead due to the padding, but that overhead is less than the amount of useful work in all cases, i. e. a sensible tradeoff in most cases. – cmaster - reinstate monica Aug 30 '14 at 16:00
1

You could extend your histogram to a "large" power of two by cycling it, filling in the trailing spaces with some dummy value (guaranteed to never occur in the real data). E.g. given a histogram

[10, 5, 6]

extend it to length 16 like so (assuming -1 is an appropriate sentinel):

[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]

Then sampling can be done via a binary mask histogram[prng() & mask] where mask = (1 << new_length) - 1, with a check for the sentinel value to retry, that is,

int value;
do {
    value = histogram[prng() & mask];
} while (value == SENTINEL);

// use `value` here

The extension is longer than necessary to make retries unlikely by ensuring that the vast majority of the elements are valid (e.g. in the example above only 1/16 lookups will "fail", and this rate can be reduced further by extending it to e.g. 64). You could even use a "branch prediction" hint (e.g. __builtin_expect in GCC) on the check so that the compiler orders code to be optimal for the case when value != SENTINEL, which is hopefully the common case.

This is very much a memory vs. speed trade-off.

huon
  • 94,605
  • 21
  • 231
  • 225
  • Branch prediction hints aren't supported by modern x86 implementations. Oh, nevermind, you're talking about compiler hints. – Oliver Charlesworth Aug 30 '14 at 15:52
  • Nevertheless, it's not clear to me (simply because I don't have the evidence at hand) whether the branch-prediction miss rate outweighs the cost of using modulo. – Oliver Charlesworth Aug 30 '14 at 15:54
  • @OliCharlesworth, I mean the `__builtin_expect` compiler intrinsics, purely so the compiler can reorder code (as my answer states); it just took me a little while to dig up the link. – huon Aug 30 '14 at 15:54
  • @dbaupp : this is the same idea cmaster mentioned (padding). But unfortunately, I don't know how to "pad" a 4Meg histogram without distorting the statistical properties of my original histogram when N is not a power of two. And the rejection ... nice idea, but likely slow cause data size gets fat. I'll try it though, it's interesting. Edit - I really like this idea for small histogram sizes. – blondiepassesby Aug 30 '14 at 15:56
  • @blondiepassesby I'm sure that a negative number is unlikely to occur in a real world histogram. – huon Aug 30 '14 at 15:58
  • @blondiepassesby: What do you mean by "data size gets fat"? – Oliver Charlesworth Aug 30 '14 at 16:02
  • @dbaupp : the sentinel is not the issue. I'm using 64bit ints, and an "all 1" value is perfect for that. It's doubling the size of the histogram data that's not working out. – blondiepassesby Aug 30 '14 at 16:02
  • @OliCharlesworth : I mean that histogram won't fit in L2 if I double or quadruple its size artificially. – blondiepassesby Aug 30 '14 at 16:06
  • @blondiepassesby: Ok, makes sense. What processor are you using with a 4MB+ L2 cache, though? – Oliver Charlesworth Aug 30 '14 at 16:08
  • @blondiepassesby, yes I imagine cache effects may be bad here; it's definitely worth measuring before a scheme like this. (Also, do you really need 64-bit ints? You could halve your data size using 32-bit ones, or if you're really lucky, 16-bit ones could work.) – huon Aug 30 '14 at 16:20
1

Just a few ideas to complement the other good answers:

  1. What percent of time is spent in the modulo operation, and how do you know what that percent is? I only ask because sometimes people say something is terribly slow when in fact it is less than 10% of the time and they only think it's big because they're using a silly self-time-only profiler. (I have a hard time envisioning a modulo operation taking a lot of time compared to a random number generator.)

  2. When does the number of buckets become known? If it doesn't change too frequently, you can write a program-generator. When the number of buckets changes, automatically print out a new program, compile, link, and use it for your massive execution. That way, the compiler will know the number of buckets.

  3. Have you considered using a quasi-random number generator, as opposed to a pseudo-random generator? It can give you higher precision of integration in much fewer samples.

  4. Could the number of buckets be reduced without hurting the accuracy of the integration too much?

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • 1. 25% of the *overall* time of the Monte-Carlo integration is spent taking that modulo. Verified with both perf and gprof. Another 25% is spent accessing the histogram when it doesn't fit inside L2. The random number generator I use is an AVX optimized version of the Mersenne Twister that generates random numbers in large batches. It only amounts for 7% of the overall computation time. The modulo is the killer. – blondiepassesby Aug 30 '14 at 23:05
  • 2. The number of bucket changes very infrequently, and the "generate of program" idea is not a bad one, but gcc doesn't do much better than libdivide in that specific case. – blondiepassesby Aug 30 '14 at 23:09
  • 3. This is an interesting idea, and I am going to experiment with this further (assuming it is faster than AVX-MT and that I don't have to take a modulo of the output of the QRNG to poke at my histogram). – blondiepassesby Aug 30 '14 at 23:10
  • 4. Short answer: I don't know. Long answer: the histogram data comes from the real world and I don't have enough stats background to know how much of it I can afford drop before my computation gets severely affected. – blondiepassesby Aug 30 '14 at 23:11
  • @blondiepassesby: 1. I'm a bit of a nut telling people [*all the faults with gprof*](http://archive.today/9r927) like it doesn't give you line or instruction level resolution. (I'd be surprised if *perf* is much better.) I would use the diagnostic method recommended in that post before I put any faith in the percents. At any rate, I would never assume there's only one thing to be fixed. If you can pick up a 30% speedup with modulo, that's great, but there could well be more. [*This explains why*.](http://stackoverflow.com/a/25432302/23771) i.e. what's the other 68% of time? – Mike Dunlavey Aug 31 '14 at 00:27
  • @blondiepassesby: I'm not sure how you calculated those percentages, but I'll assume they're correct. Bear in mind that *some* of the computation time will be overlapped with the L3 cache access; i.e. you're effectively hiding L3 latency. If you reduce the computation (even to 0%), you'll expose more L3 latency. – Oliver Charlesworth Aug 31 '14 at 08:50
  • perf is way better than gprof. It's in fact even better than shark. It certainly does give you instruction level resolution, and can deal properly with pipelining given the right flags (i.e. you don't see that movq pay the actual cost of the idiv 3 5 instructions above: attribution is more or less done correctly). If you're passionate about benchmarking, I'm kind of surprised you haven't looked at perf before. – blondiepassesby Aug 31 '14 at 08:56
  • @MikeDunlavey regarding bechmarking methodologies, if I understand correctly the point you are trying to make in your other threads, it is that there is much gain to be had in looking at the tail of the time distribution rather than always focusing on the head. Do I understand correctly ? If so, in my case the tail is essentially I/O and random number generation. The former is just one big fat read, and the latter has already been given a *lot* of attention :) – blondiepassesby Aug 31 '14 at 09:40
  • @blondiepassesby: well, it's that plus one's relying on the tool to recognize speedup opportunities, and in general that requires intelligence that the tools don't have. If you examine several stack samples, you get a gut feeling for the full reason why it's spending the time, where the tool can only give you measurements on subsets of the code. (And of course, there's the old saw about needing lots of samples. To see something big enough to worry about, a small number of samples is quite sufficient :) – Mike Dunlavey Aug 31 '14 at 13:42
  • @MikeDunlavey : oh, you mean the good old ^C in gdb benchmarking technique. No offense meant, but that is not exactly new :) I think I was first exposed to it 30 years ago. Also : take a look at perf, it's a really nice tool. Here's a good prezo on it: http://events.linuxfoundation.org/sites/events/files/slides/ELCE%20-%20fighting%20latency.pdf – blondiepassesby Aug 31 '14 at 19:57
  • @blondiepassesby: It certainly isn't new. What's new is the appreciation of how *effective* it is, provided you sample the whole stack, not just the PC, and then *manually examine* the samples, not just summarize them. Also they need to be taken on wall-clock time so as not to be blind to IO. Those slides on *perf* are nice, but say nothing about the stack. I've had academics say to me that any and all of the problems I've found could be found by a profiler, but that's common wisdom, wishful thinking. It's easy for big problems to hide from any profiler. Anyway, thanks for the conversation. – Mike Dunlavey Sep 01 '14 at 00:58
1

The non-uniformity dbaupp cautions about can be side-stepped by rejecting&redrawing values no less than M*(2^64/M) (before taking the modulus).
If M can be represented in no more than 32 bits, you can get more than one value less than M by repeated multiplication (see David Eisenstat's answer) or divmod; alternatively, you can use bit operations to single out bit patterns long enough for M, again rejecting values no less than M.
(I'd be surprised at modulus not being dwarfed in time/cycle/energy consumption by random number generation.)

greybeard
  • 2,249
  • 8
  • 30
  • 66
  • 1
    + Especially for your last sentence. – Mike Dunlavey Aug 30 '14 at 16:55
  • 1
    @greybeard : modulo is more expensive than RNG. Mersenne Twister when batched and vectorized costs close to nil. It can typically cranks out 32 bits in under 3 cycles. – blondiepassesby Aug 31 '14 at 00:00
  • @blondiepassesby: for lack of ability to comment above: what use is taking samples from a histogram over and again? I see a value in taking samples to fill a histogram. – greybeard Aug 31 '14 at 07:11
  • @greybeard : the histogram is built from observed values from a real world (physical) process. In a sense, the histogram is a "modeL" of the real world process. I am trying to incorporate this physical process in a simulation of a larger system for which and I am trying to compute the end state via Monte Carlo integration. To do that, I need to draw many random samples from the histogram. – blondiepassesby Aug 31 '14 at 08:48
0

To feed the bucket, you may use std::binomial_distribution to directly feed each bucket instead of feeding the bucket one sample by one sample:

Following may help:

int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};

std::mt19937 generator(time(nullptr));

for (int i = 0; i != N; ++i) {
    double proba = 1. / static_cast<double>(N - i);
    std::binomial_distribution<int> distribution (nrolls, proba);
    bucket[i] = distribution(generator);
    nrolls -= bucket[i];
}

Live example

Jarod42
  • 203,559
  • 14
  • 181
  • 302
  • 1
    I'm not sure I understand how your idea works. Can you explain what "feeding the bucket" means and how the binomial distribution comes into play ? What does bucket[i] represent ? I must confess: I am a bit confused by your code. – blondiepassesby Aug 30 '14 at 22:58
  • I misread, and use feed instead of pull, sorry. As I understand, you loop to select which histogram (bucket) to select each time, I propose to count how many time you take a given histogram an other way. Instead of rolling a dice 60 times and see the distribution, I compute for each number how many time its has been rolled via binomial distribution. – Jarod42 Aug 31 '14 at 14:11
0

Instead of integer division you can use fixed point math, i.e integer multiplication & bitshift. Say if your prng() returns values in range 0-65535 and you want this quantized to range 0-99, then you do (prng()*100)>>16. Just make sure that the multiplication doesn't overflow your integer type, so you may have to shift the result of prng() right. Note that this mapping is better than modulo since it's retains the uniform distribution.

JarkkoL
  • 1,898
  • 11
  • 17
0

Thanks everyone for you suggestions.

First, I am now thoroughly convinced that modulo is really evil.
It is both very slow and yields incorrect results in most cases.

After implementing and testing quite a few of the suggestions, what
seems to be the best speed/quality compromise is the solution proposed
by @Gene:

  1. pre-compute normalizer as:

    auto normalizer = histogram.size() / (1.0+urng.max());

  2. draw samples with:

    return histogram[ (uint32_t)floor(urng() * normalizer);

It is the fastest of all methods I've tried so far, and as far as I can tell,
it yields a distribution that's much better, even if it may not be as perfect
as the rejection method.

Edit: I implemented David Eisenstat's method, which is more or less the same as Jarkkol's suggestion : index = (rng() * N) >> 32. It works as well as the floating point normalization and it is a little faster (9% faster in fact). So it is my preferred way now.

blondiepassesby
  • 181
  • 1
  • 7