0

I have a dataset includig 60 predictors and a dependend variable which indicates if a purchase has taken place and how much was spend. The conversion-rate in my data 3.5% and I want to downsample it to 2.5% by excluding records with a purchase. The original distributions should be preserved.

Thanks you for your help! bjoern.

  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. But if you are asking for recommendations for methods for downsampling, also consider asking at [stats.se] where questions about statistical methods are on topic. It's unclear how you want to preserve the original distribution but change a key statistics (proportion) in that distribution at the same time. – MrFlick Jul 28 '20 at 16:11

1 Answers1

0

First, some simpler data (2 columns instead of 60) with 3.5% TRUE values in column b:

library(tidyverse)
n <- 10000

df <- data.frame(
  a = rnorm(n)) %>%
  mutate(b = row_number() <= .035*n)

df %>%
  summarize(mean(b))

  mean(b)
1   0.035

One way to downsample would be to rbind all of the FALSE values in a that you'd like to keep with a sample of the TRUE values reduced by a target amount via sample_frac:

df2 <- rbind(
  df %>% filter(!b),
  df %>% filter(b) %>% sample_frac(.025/.035)
)

df2 %>%
  summarize(mean(b))

     mean(b)
1 0.02525253

You might not get exactly 2.5%, depending on the original size of your data since we can only sample in whole numbers.

Walker Harrison
  • 527
  • 3
  • 12