While looking into a way to solve a different problem, I've come across what seems to be an issue where the base::sample()
function, when sampling without replacement, seems to truncate small - though non-zero - probabilities.
Using the following code, I drew 10 values out of 40 - 10,000 times. Note: when run with 100,000 it takes 3 min each, instead of < 3 sec.
N_simulations = 10000
N_draw_per_sim = 10
dat = data.frame(id = 1:40,
log_likelihood = seq(from = 550, to = 350, length.out = 40))
dat$lkhd_wt = (function(x) {x / sum(x)}) (exp(dat$log_likelihood - max(dat$log_likelihood)))
sample_base = function(N_) {sample(x = dat$id, size = N_, prob = dat$lkhd_wt, replace = FALSE)}
sample_manual = function(N_) {
# manually sample once, remove drawn value, re-weight
smpl_ = sample(x = dat$id, size = 1, prob = (function(x) {x / sum(x)}) (dat$lkhd_wt), replace = FALSE)
for(i in 2:N_) {
prv_idx = which(dat$id %in% smpl_)
smpl_ = c(smpl_, sample(x = dat$id[-prv_idx], size = 1, prob = (function(x) {x / sum(x)}) (dat$lkhd_wt[-prv_idx]), replace = FALSE))
}
return(smpl_)
}
# 10k < 2 sec
dat_sim_draw_basic = as.data.frame(matrix(0, nrow = dim(dat)[1], ncol = N_simulations))
for(sim_i in 1:N_simulations) {dat_sim_draw_basic[sample_base(N_draw_per_sim), sim_i] = 1}
# 10k < 3 sec
dat_sim_draw_manual = as.data.frame(matrix(0, nrow = dim(dat)[1], ncol = N_simulations))
for(sim_i in 1:N_simulations) {dat_sim_draw_manual[sample_manual(N_draw_per_sim), sim_i] = 1}
# Calculate weights from the count
dat_agg_basic = data.frame(id = dat$id, cnt = apply(dat_sim_draw_basic, 1, sum))
dat_agg_basic$wt = dat_agg_basic$cnt / sum(dat_agg_basic$cnt)
dat_agg_manual = data.frame(id = dat$id, cnt = apply(dat_sim_draw_manual, 1, sum))
dat_agg_manual$wt = dat_agg_manual$cnt / sum(dat_agg_manual$cnt)
# Merge for comparison
TF_non_zero = dat_agg_basic$wt != 0 | dat_agg_manual$wt != 0
dat_compare = merge(x = dat_agg_basic[TF_non_zero,],
y = dat_agg_manual[TF_non_zero,],
by.x = c("id"),
by.y = c("id"),
all.x = TRUE,
all.y = TRUE)
colnames(dat_compare) = c("id", "cnt_basic", "wt_basic", "cnt_manual", "wt_manual")
dat_compare = dat_compare[,c("id", "cnt_basic", "cnt_manual", "wt_basic", "wt_manual")]
dat_compare
The output (i.e. dat_compare
)
id cnt_basic cnt_manual wt_basic wt_manual
1 1 10000 10000 0.1 0.10000
2 2 10000 10000 0.1 0.10000
3 3 10000 10000 0.1 0.10000
4 4 10000 10000 0.1 0.10000
5 5 10000 10000 0.1 0.10000
6 6 10000 10000 0.1 0.10000
7 7 10000 10000 0.1 0.10000
8 8 10000 10000 0.1 0.10000
9 9 10000 10000 0.1 0.10000
10 10 10000 9958 0.1 0.09958
11 11 0 42 0.0 0.00042
This seems to indicate that the weights passed to sample()
are not used in their entirety. My guess would be that the small weights are truncated to zero and never recalculated after draws.
I tried multiplying the weights by different powers (10^i, where i is 1:86) and counting the number of weights that are equal to 0.1. The sample()
function has 10 weights equal to 0.1. However, the plot is odd; I would have expected it to be split into distinct regions, but it seems to just jump all over the place.
This leaves me thoroughly confused. Does anyone know why this is, a way to force it to recalculate, or a way around this issue? Other than writing a function to draw one at a time and recalculate.