1

I have the same question as Resample with replacement by cluster, i.e. I want to do cluster bootstrapping. The best answer's approach to that question using rbindlist(lapply(resampled_ids, function(resampled_id) df[df$id == resampled_id,])) works, but because I have a big dataset, this resampling step is rather slow. My question is, is it possible to speed this up?

elbord77
  • 311
  • 1
  • 2
  • 11

2 Answers2

1

Use sequence to index. Demonstrated with a larger data.frame:

df <- data.frame(id = rep.int(1:1e2, sample(100:200, 1e2, replace = TRUE))[1:1e4], X = rnorm(1e4))
resampled_ids <- sample(unique(df$id), replace = TRUE)

idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(id = df$id[idx], X = df$X[idx])

Benchmarking against the rbindlist solution:

library(data.table)
library(microbenchmark)

microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
               sequence = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
                           data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#>       expr    min      lq      mean   median       uq     max neval
#>  rbindlist 9480.4 9921.95 11470.567 10431.05 12555.35 31178.2   100
#>   sequence  406.7  444.55   564.873   498.10   545.70  2818.4   100

Note that creating a new data.frame from indexed vectors is much faster than row-indexing the original data.frame. The difference is much less pronounced if a data.table is used, but, surprisingly, the rbindlist solution becomes even slower:

microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
               sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
               sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
                            data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#>       expr    min     lq      mean   median       uq     max neval
#>  rbindlist 9431.9 9957.7 11101.545 10508.15 12395.25 15363.3   100
#>  sequence1 4284.5 4550.3  4866.891  4674.80  5009.90  8350.1   100
#>  sequence2  414.1  455.6   541.590   508.40   551.40  2881.1   100

    setDT(df)
    
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
               sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
               sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
                            data.table(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#>       expr     min       lq      mean   median      uq     max neval
#>  rbindlist 14877.4 15878.30 17181.572 16348.50 18527.6 22520.9   100
#>  sequence1   795.0  1016.80  1187.266  1101.95  1326.7  2566.5   100
#>  sequence2   386.4   441.75   556.226   473.70   500.9  3373.6   100

Update

To address the comment from jay.sf:

lens <- tabulate(df$id)[resampled_ids]
idx <- sequence(lens, match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(cluster = rep.int(seq_along(resampled_ids), lens), id = df$id[idx], X = df$X[idx])

cluster corresponds to the index of resampled_ids.

jblood94
  • 10,340
  • 1
  • 10
  • 15
  • Same problem as with _@f-privé_'s [answer](https://stackoverflow.com/a/70889582/6574038), cluster with same ID can't be distinguished after bootstrap, this might be essential. – jay.sf Jan 28 '22 at 15:28
  • @jay.sf, It wasn't mentioned by the OP, but I can see how it could be important. Fortunately, including the `resampled_ids` in the `data.frame` is straightforward. See the update. – jblood94 Jan 28 '22 at 15:43
  • You nailed it, looks great now, upvoted! – jay.sf Jan 28 '22 at 16:19
  • Works much better! Just want to note that the ID must be numeric or factor for this, and if you have more than a single `X`, you might want to go the row index route using a `data.table` unless there's a faster way. – elbord77 Jan 28 '22 at 17:19
  • Good points. `tabulate` takes only numeric or factor, but easy to convert to numeric: `uid <- unique(df$id); df$id <- match(df$id, uid)` and then back to, e.g., character: `s <- data.frame(id = uid[df$id[idx]], X = df$X[idx])`. – jblood94 Jan 28 '22 at 17:43
0
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )

Try this:

ind_id <- split(seq_along(f$id), f$id)
samp_id <- sample(names(ind_id), replace = TRUE)
f[unlist(ind_id[samp_id]), ]
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • It's fast, but the problem is that in the result several draws of a cluster have the same ID and you can't distinguish them anymore. Try e.g. `set.seed(79) `. – jay.sf Jan 28 '22 at 09:39