dplyr: Sample size greater than population size

Question

I have a dataframe:

> class(dataset)
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
> dim(dataset)
[1] 64480    39

where I want to sample 50.000 samples from

> dataset %>% dplyr::sample_n(50000)

But keeps giving me the error

Error: Sample size (50000) greater than population size (1). Do you want to replace = TRUE?

But e.g. that works:

> dim(dataset[1] %>% dplyr::sample_n(50000))
[1] 50000     1

So why is my population size (1) - does that have something to do with grouping?

Please provide a reproducible example. dplyr outputs an error about your data, so to answer the question we need to see your data (sample of it or made-up example). — Tim, Nov 10 '15 at 12:06
It's very likely to be about grouping as you have `"grouped_df"`. Try to ungroup it and run the same code. — AntoniosK, Nov 10 '15 at 12:09
Yes, it probably has to do with grouping. As you can see from the output of `class(dataset)` your data is currently grouped and some groups may have too few observations to sample 50000 without replacement. Try `dataset %>% ungroup() %>% dplyr::sample_n(50000)` — talat, Nov 10 '15 at 12:09

score 5 · Accepted Answer · answered Nov 10 '15 at 12:18

5

Yes, it probably has to do with grouping. As you can see from the output of class(dataset) your data is currently grouped (note the grouped_df info) and one or more groups apparently have too few observations to sample 50000 observations without replacement.

To resolve this, you can either ungroup your data before sampling:

dataset %>% ungroup() %>% sample_n(50000)

Or you can sample with replacement:

dataset %>% sample_n(50000, replace = TRUE)

answered Nov 10 '15 at 12:18

talat

68,970
21
126
157

But can you specify to sample either a particular value or the maximum value of rows? It would be nice to subsample very large groups but not have to resample small groups to accomplish that. – evolvedmicrobe Jun 26 '16 at 19:25

score 2 · Answer 2 · answered Jun 26 '16 at 20:08

Unfortunately, dplyr does not allow you to "Sample down" large groups to a given size or just use all of the group's data if it's a small group - either you must sample everything to the smallest group size, or sample the smallest group with replacement to "inflate" it to a larger size. You can work around this by defining a custom sample_n function as follows though:

 ### Custom sampler function to sample min(data, sample) which can't be done with dplyr
 ### it's a modified copy of sample_n.grouped_df
 sample_vals <- function (tbl, size, replace = FALSE, weight = NULL, .env = parent.frame()) 
 {
   #assert_that(is.numeric(size), length(size) == 1, size >= 0)
   weight <- substitute(weight)
   index <- attr(tbl, "indices")
   sizes = sapply(index, function(z) min(length(z), size)) # here's my contribution
   sampled <- lapply(1:length(index), function(i) dplyr:::sample_group(index[[i]],  frac = FALSE, tbl = tbl, 
                                       size = sizes[i], replace = replace, weight = weight, .env = .env))
   idx <- unlist(sampled) + 1
   grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
 }

 samped_data = dataset %>% group_by(something) %>% sample_vals(size = 50000) %>% ungroup()

You can achieve `"Sample down" large groups to a given size or just use all of the group's data if it's a small group` if you use `sample_frac(1)` and filter: http://stackoverflow.com/questions/30950016/dplyr-sample-n-where-n-is-the-value-of-a-grouped-variable — Alex, Jan 17 '17 at 01:09
Oh nice, but that would that force a creation of an entirely new data frame with the permutated rows before sampling? That seems very expensive. — evolvedmicrobe, Jan 17 '17 at 01:22
Or maybe a simpler method would be tbl %>% mutate(id = sample(1:n()) %>% filter(id <= 10000) for example. — ColinTea, Nov 06 '18 at 16:47

dplyr: Sample size greater than population size

2 Answers2

Linked