How to iteratively take random sample from R datatable until different column values equal sample size in R?

Question

I have an inventory dataframe that is like:

set.seed(5)
library(data.table)

#replicated data
invntry <- data.table(
  warehouse <- sample(c("NY", "NJ"), 1000, replace = T),
  intid <- c(rep(1,150), rep(2,100), rep(3,210), rep(4,50), rep(5,80), rep(6,70), rep(7,140), rep(8,90), rep(9,90), rep(10,20)),
  placement <- c(1:150, 1:100, 1:210, 1:50, 1:80, 1:70, 1:140, 1:90, 1:90, 1:20),
  container <- sample(1:100,1000, replace = T),
  inventory <- c(rep(3242,150), rep(9076,100), rep(5876,210), rep(9572,50), rep(3369,80), rep(4845,70), rep(8643,140), rep(4567,90), rep(7658,90), rep(1211,20)),
  stock <- c(rep(3200,150), rep(10000,100), rep(6656,210), rep(9871,50), rep(3443,80), rep(5321,70), rep(8659,140), rep(4567,90), rep(7650,90), rep(1298,20)),
  risk <- runif(100)
)

setnames(invntry, c("warehouse", "intid", "placement", "container", "inventory", "stock", "risk"))
invntry[ , ticket := 1:.N, by=c("intid", "warehouse")]
invntry$ticket[invntry$warehouse=="NJ"] <- 0

#ensuring some same brands are same container
invntry$container[27:32] <- 6
invntry$container[790:810] <- 71
invntry[790:820,]

There's more variables in the actual data that I want to use to compare the same items itid that are in different containers. So I would like to conduct multiple trials for a given range of sample sizes n for each item, such that I keep randomly selecting an item until I have n items from different containers, but keeping the duplicates if they've already been selected. So for a sample size of 6 for item 8, it might take 7 tries to get a sample size of 6:

    warehouse intid placement container inventory stock       risk    ticket
21: NY          8       10       71         4567  4567     0.38404806      5
22: NY          8       11       96         4567  4567     0.64665968      6
23: NJ          8       12       15         4567  4567     0.68265602      0
24: NY          8       13       19         4567  4567     0.84437586      7
21: NY          8       10       71         4567  4567     0.38404806      5
26: NY          8       15       34         4567  4567     0.69580270      8
28: NY          8       17       78         4567  4567     0.25352370      9

I tried searching on this site, but couldn't find for the above and something to accommodate wanting to compute some values for each trial and sample size from the trial's rows' columns so I think I have to use a for loop so that I can distinguish each trial for each sample size. To summarize, two goals:

conduct random sampling of each itid n unique containers are selected cumulatively keeping the itids already selected
be able to do calculations on variables for each trial for each sample size for each item

Any ideas?

*doesn't have to involve data.table, that's just how it got started

(I think it's essentially the basic probability example of continuing to draw marbles from the urn until you have a sample size of all different colors-but even realizing that didn't help me find a solution!)

score 1 · Answer 1 · edited May 23 '17 at 10:31

I'm not positive, but isn't this equivalent to grouping by intid and then sampling n values with replacement, where n is some integer? If so, then here's a way to do that using tidyverse functions. The code below groups by intid and samples 6 through 10 values with replacement from each group. The column Sample_Size identifies each n-sample group for each intid:

library(tidyverse)

invntry.sampled = map_df(setNames(6:10, 6:10), 
                         ~ invntry %>% 
                           group_by(intid) %>% 
                           sample_n(.x, replace=TRUE),
                         .id="Sample_Size")

And here's a data.table approach, using code adapted from this SO answer. I've wrapped the data.table code in lapply to cycle through the different sample sizes, as my data.table skills are limited. There may be a way to do this within the data.table code itself.

invntry.sampled = do.call(rbind,
                          lapply(6:10, function(n) invntry[ , .SD[sample(.N, n, replace=TRUE)], by=intid]))

Thanks for the fast reply. It's close to what I'm trying for except that I'm looking to sample n distinct `container` values for each `itid`, and to keep any duplicate `container` rows selected before the n distinct values are achieved. When I check the first bit of code. `container` 81 is sampled 2 times for `itid` 1 and both are part of the sample size 6. In that case, I'd like to keep selecting until 6 distinct `containers` are achieved for `itid==1`. — usr342678, Apr 11 '17 at 04:53

How to iteratively take random sample from R datatable until different column values equal sample size in R?

1 Answers1