Suppose I have large buffers of uniformly random bytes (entropy source). I want to use it to draw many samples (e.g. 10^7 at a time) from a fixed (rational) probability distribution over a finite set (e.g., 8 elements).
I need
- a theoretical guarantee that the specified distribution is reproduced exactly.
- to be reasonably efficient with the random bits. E.g., if the Shannon entropy
H
of my distribution (over 8 symbols) is around 2.3 and I would like to use at most 3 bits from my stream on average to draw a sample. Even better would be, say, within 20% of the Shannon limit. - to sample quickly. At the very least 100 Mbyte/sec on "one core of a standard processor".
- reasonable RAM usage (not counting stored sampling results) below, say, 200MB
I do not care about the runtime of pre-computations that need to be done once per distribution.
There are very many algorithms and implementations to chose from and I'm having trouble comparing them in terms of trading off entropy consumption, speed and memory. I found an overview in this SO-question. There are also many papers comparing algorithms (e.g. arXiv:1502.02539v6) and new algorithms being proposed (e.g. the "Fast Loaded Dice Roller" arXiv:2003.03830v2).
Knuth and Yao show that any optimal (in terms of entropy consumption) algorithm (that spits out one sample at a time) consumes between H
and H+2
bits of entropy. By drawing multiple samples (i.e., sampling from the product distribution), one can get closer to the Shannon limit of using H
bits per sample on average. This is sometimes called "batching".
My first instinct would thus be to use, say, the available C implementation of the "Fast Loaded Dice Roller" after "packing" my symbols to get a distribution over a range of integers that fit into one (or a few) bytes. However, the descriptions of these algorithms don't seem to focus on "batching". I wonder, perhaps other methods can be more efficient (in either entropy consumption, speed or memory needs) by making use of my large batch sizes?