R's Base Sample Function (Without Replacement) Ignores Small Probabilities

Question

While looking into a way to solve a different problem, I've come across what seems to be an issue where the base::sample() function, when sampling without replacement, seems to truncate small - though non-zero - probabilities.

Using the following code, I drew 10 values out of 40 - 10,000 times. Note: when run with 100,000 it takes 3 min each, instead of < 3 sec.

N_simulations   = 10000
N_draw_per_sim  = 10

dat = data.frame(id             = 1:40,
                 log_likelihood = seq(from = 550, to = 350, length.out = 40))

dat$lkhd_wt = (function(x) {x / sum(x)}) (exp(dat$log_likelihood - max(dat$log_likelihood)))

sample_base     = function(N_) {sample(x = dat$id, size = N_, prob = dat$lkhd_wt, replace = FALSE)}
sample_manual   = function(N_) {
    # manually sample once, remove drawn value, re-weight
    smpl_ = sample(x = dat$id, size = 1, prob = (function(x) {x / sum(x)}) (dat$lkhd_wt), replace = FALSE)
    for(i in 2:N_) {
        prv_idx = which(dat$id %in% smpl_)
        smpl_ = c(smpl_, sample(x = dat$id[-prv_idx], size = 1, prob = (function(x) {x / sum(x)}) (dat$lkhd_wt[-prv_idx]), replace = FALSE))
    }
    return(smpl_)
}

# 10k < 2 sec
dat_sim_draw_basic  = as.data.frame(matrix(0, nrow = dim(dat)[1], ncol = N_simulations))
for(sim_i in 1:N_simulations) {dat_sim_draw_basic[sample_base(N_draw_per_sim),    sim_i] = 1}

# 10k < 3 sec
dat_sim_draw_manual = as.data.frame(matrix(0, nrow = dim(dat)[1], ncol = N_simulations))
for(sim_i in 1:N_simulations) {dat_sim_draw_manual[sample_manual(N_draw_per_sim), sim_i] = 1}

# Calculate weights from the count
dat_agg_basic       = data.frame(id = dat$id, cnt = apply(dat_sim_draw_basic,  1, sum))
dat_agg_basic$wt    = dat_agg_basic$cnt  / sum(dat_agg_basic$cnt)

dat_agg_manual      = data.frame(id = dat$id, cnt = apply(dat_sim_draw_manual, 1, sum))
dat_agg_manual$wt   = dat_agg_manual$cnt / sum(dat_agg_manual$cnt)

# Merge for comparison
TF_non_zero = dat_agg_basic$wt != 0 | dat_agg_manual$wt != 0
dat_compare = merge(x     = dat_agg_basic[TF_non_zero,],
                    y     = dat_agg_manual[TF_non_zero,],
                    by.x  = c("id"),
                    by.y  = c("id"),
                    all.x = TRUE,
                    all.y = TRUE)

colnames(dat_compare) = c("id", "cnt_basic", "wt_basic", "cnt_manual", "wt_manual")
dat_compare = dat_compare[,c("id", "cnt_basic", "cnt_manual", "wt_basic", "wt_manual")]
dat_compare

The output (i.e. dat_compare)

   id cnt_basic cnt_manual wt_basic wt_manual
1   1     10000      10000      0.1   0.10000
2   2     10000      10000      0.1   0.10000
3   3     10000      10000      0.1   0.10000
4   4     10000      10000      0.1   0.10000
5   5     10000      10000      0.1   0.10000
6   6     10000      10000      0.1   0.10000
7   7     10000      10000      0.1   0.10000
8   8     10000      10000      0.1   0.10000
9   9     10000      10000      0.1   0.10000
10 10     10000       9958      0.1   0.09958
11 11         0         42      0.0   0.00042

This seems to indicate that the weights passed to sample() are not used in their entirety. My guess would be that the small weights are truncated to zero and never recalculated after draws.

I tried multiplying the weights by different powers (10^i, where i is 1:86) and counting the number of weights that are equal to 0.1. The sample() function has 10 weights equal to 0.1. However, the plot is odd; I would have expected it to be split into distinct regions, but it seems to just jump all over the place.

This leaves me thoroughly confused. Does anyone know why this is, a way to force it to recalculate, or a way around this issue? Other than writing a function to draw one at a time and recalculate.

I confess I don't follow your logic at all. However a very simple examination of the behaviour of `sample`, namely `hist(sapply(1:10000, function(x) sample(1:40, 10, prob=c(rep(0.999/39, 39), 0.001))), breaks=0:40, freq=FALSE)`, behaves as I would expect. (and it runs in a fraction of a second.) Therefore, I think you have an error somewhere in your logic. — Limey, May 03 '23 at 18:41
The documentation for `sample()` says, "If `replace` is false, these probabilities are applied sequentially, that is the probability of choosing the next item is proportional to the weights amongst the remaining items. The number of nonzero weights must be at least size in this case." Perhaps that has some bearing on what's happening. — DaveArmstrong, May 03 '23 at 18:44
You can just do something like `r1 <- replicate(1e5, sample_base(10)); table(r1)` to get the counts. But I think you're right, there appears to be some loss of accuracy when the weights are updated? — Axeman, May 03 '23 at 19:17
@DaveArmstrong, doesn't that describe the same thing as OP is doing in their manual function? — Axeman, May 03 '23 at 19:18

jpsmith · Accepted Answer · 2023-05-03T19:52:22.907

In brief, you are running into the common floating point issue in computing.

Your probabilities are extremely low (up to 87 zeros - i.e. 1.3e-87). From a pure probability standpoint, in 10,000 draws, you still are extremely unlikely to see it. But that is also well below what most machines can handle.

Without getting into too many details, you can check the smallest positive floating-point number on your machine. As of R 4.0 you can check to see if your machine supports C long double type which is longer than double.

capabilities()["long.double"]
# long.double 
#        TRUE

My machine supports long double

If you run .Machine you will see your machine's specifications:

.Machine$double.eps
# [1] 2.220446e-16

.Machine$longdouble.neg.eps
# [1] 5.421011e-20

I can run to about 20 decimals, which corresponds with your ~10th position in your probability vector dat$lkhd_wt.

Stealing a bit of code from Limey to quickly see how this plays out:

hist(sapply(1:1e4, 
            function(x) sample(1:40, 10, prob = dat$lkhd_wt)), breaks=0:40, freq=FALSE)

Dang it, I keep forgetting about the floating point issue. This was it. When changing the range of weights, so the smallest is above the .Machine$double.eps value, both sample functions look a lot more similar. I also changed the power check, so instead of just multiplying by 10^i, I capped the individual weights to be no more than 1. So when they're normalized, the larger the power (i) the greater the number of weights that are above the .Machine$double.eps value. This make a (fairly) smoothly increasing graph (i.e. more id's drawn at least once, as power increases). Thank you. — Ryan S, May 03 '23 at 20:00
@RyanS Glad it helped - I also often work with small probabilities so glad someone can commiserate! — jpsmith, May 03 '23 at 20:06

R's Base Sample Function (Without Replacement) Ignores Small Probabilities

1 Answers1