2

I have a dataframe:

> class(dataset)
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
> dim(dataset)
[1] 64480    39

where I want to sample 50.000 samples from

> dataset %>% dplyr::sample_n(50000)

But keeps giving me the error

Error: Sample size (50000) greater than population size (1). Do you want to replace = TRUE?

But e.g. that works:

> dim(dataset[1] %>% dplyr::sample_n(50000))
[1] 50000     1

So why is my population size (1) - does that have something to do with grouping?

UseR10085
  • 7,120
  • 3
  • 24
  • 54
Boern
  • 7,233
  • 5
  • 55
  • 86
  • Please provide a reproducible example. dplyr outputs an error about your data, so to answer the question we need to see your data (sample of it or made-up example). – Tim Nov 10 '15 at 12:06
  • It's very likely to be about grouping as you have `"grouped_df"`. Try to ungroup it and run the same code. – AntoniosK Nov 10 '15 at 12:09
  • 2
    Yes, it probably has to do with grouping. As you can see from the output of `class(dataset)` your data is currently grouped and some groups may have too few observations to sample 50000 without replacement. Try `dataset %>% ungroup() %>% dplyr::sample_n(50000)` – talat Nov 10 '15 at 12:09
  • tried just that. Did it ! Thanks ! – Boern Nov 10 '15 at 12:10

2 Answers2

5

Yes, it probably has to do with grouping. As you can see from the output of class(dataset) your data is currently grouped (note the grouped_df info) and one or more groups apparently have too few observations to sample 50000 observations without replacement.

To resolve this, you can either ungroup your data before sampling:

dataset %>% ungroup() %>% sample_n(50000)

Or you can sample with replacement:

dataset %>% sample_n(50000, replace = TRUE)
talat
  • 68,970
  • 21
  • 126
  • 157
  • But can you specify to sample either a particular value or the maximum value of rows? It would be nice to subsample very large groups but not have to resample small groups to accomplish that. – evolvedmicrobe Jun 26 '16 at 19:25
2

Unfortunately, dplyr does not allow you to "Sample down" large groups to a given size or just use all of the group's data if it's a small group - either you must sample everything to the smallest group size, or sample the smallest group with replacement to "inflate" it to a larger size. You can work around this by defining a custom sample_n function as follows though:

 ### Custom sampler function to sample min(data, sample) which can't be done with dplyr
 ### it's a modified copy of sample_n.grouped_df
 sample_vals <- function (tbl, size, replace = FALSE, weight = NULL, .env = parent.frame()) 
 {
   #assert_that(is.numeric(size), length(size) == 1, size >= 0)
   weight <- substitute(weight)
   index <- attr(tbl, "indices")
   sizes = sapply(index, function(z) min(length(z), size)) # here's my contribution
   sampled <- lapply(1:length(index), function(i) dplyr:::sample_group(index[[i]],  frac = FALSE, tbl = tbl, 
                                       size = sizes[i], replace = replace, weight = weight, .env = .env))
   idx <- unlist(sampled) + 1
   grouped_df(tbl[idx, , drop = FALSE], vars = groups(tbl))
 }

 samped_data = dataset %>% group_by(something) %>% sample_vals(size = 50000) %>% ungroup()
evolvedmicrobe
  • 2,672
  • 2
  • 22
  • 30
  • You can achieve `"Sample down" large groups to a given size or just use all of the group's data if it's a small group` if you use `sample_frac(1)` and filter: http://stackoverflow.com/questions/30950016/dplyr-sample-n-where-n-is-the-value-of-a-grouped-variable – Alex Jan 17 '17 at 01:09
  • Oh nice, but that would that force a creation of an entirely new data frame with the permutated rows before sampling? That seems very expensive. – evolvedmicrobe Jan 17 '17 at 01:22
  • i don't know about that :) – Alex Jan 17 '17 at 01:31
  • Or maybe a simpler method would be tbl %>% mutate(id = sample(1:n()) %>% filter(id <= 10000) for example. – ColinTea Nov 06 '18 at 16:47