0

I have a large dataset, and I have multiple groups I want to sample. Each group has a certain number of positive cases, with a value of 1, and a lot more negative cases, with a value of zero.

For each group, I want to select all the positive cases, and then a random amount of negative cases equal to 4x the amount of positive cases in that group.

I also need something that run quickly on a lot of data.

Semi-Update:

stratified_sample = data %>%
    group_by(group) %>%
    mutate(n_pos = sum(response == 1),
           n_neg = 4 * n_pos) %>%
  group_by(group,response) %>%
  mutate(rec_num = n(),
         random_val = runif(n()),
         random_order = rank(random_val)) %>%
    filter(response == 1 | random_order <= n_neg)
Nate Thompson
  • 625
  • 1
  • 7
  • 22
  • Including a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question will increase your chances of getting an answer. – Samuel Nov 09 '17 at 20:32

1 Answers1

0

This should work if you sub in the correct names. If you have issues, provide a reproducible example.

library(dplyr)

stratified_sample = your_large_dataset %>%
    group_by(whatever_your_grouping_variable_is) %>%
    mutate(n_pos = sum(column_name_of_your_label == 1),
           n_neg = sum(column_name_of_your_label == 0),
           cutoff = 4 * n_pos / n_neg) %>%
    filter(column_name_of_your_label == 1 | runif(n()) < cutoff)

This gives each negative case a probability of 4 * number of positive cases / number of negative cases to be selected, so the sample fraction won't be exact, but it has the expected value that you want.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • You got me really close, but with the nature of how you are doing the cutoffs sometimes it gives exactly 4 times the positive case, but sometimes it gives more depending on how the random values shake out. I posted a "Semi:update" with code that I got to work. Go ahead and change your answer to that or modify yours for the answer credit. – Nate Thompson Nov 10 '17 at 13:27