0

I'm trying to draw a random sample of rows without replacement from a dataset such that the sum of a column in the sample should be strictly within a range. For the example dataset mtcars, the random sample should be such that the sum of mpg is strictly within 90-100.

A reproducible example:

data("mtcars")

random_sample <- function(dataset){
  final_mpg = 0
  while (final_mpg < 100) {
    basic_dat <- dataset %>%
      sample_n(1) %>%
      ungroup()
    total_mpg <- basic_dat %>%
      summarise(mpg = sum(mpg)) %>%
      pull(mpg)
    final_mpg <- final_mpg + total_mpg
    if (final_mpg > 90 & final_mpg < 100){
      break()
    }
    final_dat <- rbind(get0("final_dat"), get0("basic_dat"))
  }
  return(final_dat)
}

chosen_sample <- random_sample(mtcars)

But this function output samples with sum(mpg) > 100. How do I ensure that every sample it generates is strictly within that range? Any help is much appreciated.

Debbie
  • 391
  • 2
  • 18
  • 1
    It's good that you're trying to generalize this into a function, but some concerns: (1) the dataset *must* contain a column `mpg`, (2) that field must be numeric, and (3) there must be an appropriate selection of values such that one is not too large (100), and the sum of all is not too small (90). With your `while` loop, it is entirely possible to get into a situation where the sample is broken and no solution exists. – r2evans Apr 19 '20 at 03:58
  • @Debbie I have edited my answer. Check if this is working for you. – Mohanasundaram Apr 19 '20 at 05:01
  • 1
    @r2evans - I knew my attempt was far from perfect and pointers around improving them was very helpful. – Debbie Apr 19 '20 at 14:30

2 Answers2

0

This is working. because of the values of mpg, it couldn't get more than 90.

ransmpl <- function(df) { 
  s1<- df[sample(rownames(df),1),] 
  s11 <- sum(s1$mpg) 
  while(s11<100){
    rn2<- rownames(df[!(rownames(df) %in% rownames(s1)),]) 
    nr<- df[sample(rn2,1),] 
    s11 <- sum(rbind(s1,nr)$mpg) 
    if(s11>100){ 
      break() 
    } 
    s1<-rbind(s1,nr) 
  } 
  return(s1) 
  }


chosen_sample <- ransmpl(mtcars)
chosen_sample

Output

> chosen_sample
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

> sum(chosen_sample$mpg)
[1] 95.1
Mohanasundaram
  • 2,889
  • 1
  • 8
  • 18
  • Thanks @Mohanasundaram. Although your answer works, it throws a bunch of warnings as you are trying to compare two objects of unequal length. The below edits should fix it: `ransmpl <- function(df) { s1<- df[sample(rownames(df),1),] s11 <- sum(s1$mpg) while(s11<100){ rn2<- rownames(df[!(rownames(df) %in% rownames(s1)),]) nr<- df[sample(rn2,1),] s11 <- sum(rbind(s1,nr)$mpg) if(s11>100){ break() } s1<-rbind(s1,nr) } return(s1) }` – Debbie Apr 19 '20 at 13:50
  • Thank you @Debbie. This works better. I have edited my answer. – Mohanasundaram Apr 19 '20 at 19:38
0

Here's a hack, though realize that there's never a guarantee that it'll work.

#' Random sampling of data
#'
#' Return a sample of the dataset's rows where the sum of 'fld' values
#' is between the two numbers of 'sumbetween'.
#'
#' @param dat data.frame
#' @param fld character, the name of one of the fields in 'dat'
#' @param sumbetween numeric, length 2, the two ends of the range of
#'   desired sum
#' @param suggestn integer, a suggestion for 'n' around which sample
#'   sizes are based; the actual samples attempted will vary between
#'   0.5 and 1.5 times this value; if 'NA' (the default), then it
#'   defaults naively to 'mean(sumbetween) / median(dat[[fld]])'
#' @param iters integer, number of samples to attempt before
#'   "giving up" (otherwise this might run forever)
#' @return data.frame, a sample of the original dataset; regardless of
#'   success, two attributes are included, 'mu' and 'sigma',
#'   indicating the mean and standard deviation of the samples tested
random_sample <- function(dat, fld, sumbetween, suggestn = NA, iters = 100) {
  stopifnot(fld %in% names(dat), is.numeric(dat[[fld]]), is.numeric(sumbetween))

  if (is.na(suggestn)) {
    suggestn <- mean(sumbetween) / median(dat[[fld]])
  }
  suggestn <- min(suggestn, nrow(dat))

  mu <- NA
  Sn <- 0
  ind <- FALSE
  n <- 0L

  while ((is.na(iters) || n < iters) && !ind) {
    n <- n + 1L
    size <- min(nrow(dat), sample(seq(max(1, floor(suggestn/2)), ceiling(suggestn*1.5)), size = 1))
    rows <- sample(nrow(dat), size = size)
    s <- sum(dat[[fld]][rows])
    ind <- sumbetween[1] <= s & s <= sumbetween[2]
    # incremental mean and almost-variance of the samples
    # http://datagenetics.com/blog/november22017/index.html
    lastmu <- mu
    mu <- sum(s, (n-1)*mu, na.rm = TRUE)/n
    Sn <- Sn + sum(s, -lastmu, na.rm = TRUE)*sum(s, -mu, na.rm = TRUE)
  }

  out <- if (ind) dat[rows,] else NA
  if (!ind) warning("unable to find a successful sample after ", n, " iterations")
  # actual mean and variance of samples, successful or not
  attr(out, "mu") <- mu
  attr(out, "sigma") <- sqrt(Sn / n)
  return(out)
}

And its use is below. I use str here to demonstrate one feature: the addition of the all tested samples' means and deviations as attributes. If success, the attributes are not shown (print.data.frame by default shows no attributes), but if it fails then a warning will be given, and NA returned with the same attributes.

set.seed(42)
str(random_sample(mtcars, "mpg", c(90,100), iters=20))
# Warning in random_sample(mtcars, "mpg", c(90, 100), iters = 20) :
#   unable to find a successful sample after 20 iterations
#  logi NA
#  - attr(*, "mu")= num 106
#  - attr(*, "sigma")= num 37.9
str(random_sample(mtcars, "mpg", c(90,100), iters=20))
# 'data.frame': 5 obs. of  12 variables:
#  $ mpg : num  33.9 14.3 14.7 18.1 17.3
#  $ cyl : num  4 8 8 6 8
#  $ disp: num  71.1 360 440 225 275.8
#  $ hp  : num  65 245 230 105 180
#  $ drat: num  4.22 3.21 3.23 2.76 3.07
#  $ wt  : num  1.83 3.57 5.34 3.46 3.73
#  $ qsec: num  19.9 15.8 17.4 20.2 17.6
#  $ vs  : num  1 0 0 1 0
#  $ am  : num  1 0 0 0 0
#  $ gear: num  4 3 3 3 3
#  $ carb: num  1 4 4 1 3
#  $ new1: num  75.1 368 448 231 283.8
#  - attr(*, "mu")= num 96.1
#  - attr(*, "sigma")= num 42.1

The intent of the returns mean/deviation is to help the user determine if the suggestn (suggestion for a starting sample size) is mis-placed, or if iters is just too small and we quit too early (such as when the intended range is well within mu +/- sigma).

This uses iters to prevent an infinite loop. You can disable it (off to the races!) at your own peril.

This makes no promises that a feasible solution can be found. Imagine all values are multiples of 20, and the desired range is only 10 wide. There are certainly other conditions that are heuristically difficult to "know" with certainty to know if a solution exists.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for the detailed reply. I'm still digesting your answer. What is the rationale behind `suggestn <- mean(sumbetween) / median(dat[[fld]])`? – Debbie Apr 19 '20 at 14:20
  • That is a reasonable (imo) guess at a number of samples required to sum correctly. I take sample sizes of 0.5 and 1.5 times that number. As a whole (and including that heuristic), this is meant add an alternative to your iterative-build method which, while it does not need to know sample size (even a suggestion) a priori, it can get to a point where no datum will place the sum within the target range. One could also take sample sizes from "2” (literally) to 2 or 3 times `suggestn` to expand the search, though it may take more `iters` before success. – r2evans Apr 19 '20 at 15:43