0

So I have a column (category) that contains either "Yes" or "No" in my df and in order to create a more balanced sample I want to select the rows with the first 500 cases of "Yes" and the first 500 cases of "No" from my dataset.

I've tried this code:

top_n(df,500, category=="Yes")

But this select ALL cases of yes instead of only the first 500 I also tried this but this gave me an error though I'm sure it makes no sense

df %>% filter(top_n(500, category == "Yes") & top_n(500, category=="No")) I need a bit of help with the right direction

2 Answers2

0

I'd probably just use head for this, and filter directly on the data frame

df1 <- head(df[df$category == "Yes",], 500)
df2 <- head(df[df$category == "No",], 500)

# to combine
out <- rbind(df1, df2)

I'm guessing top_n does something similar. I expect there is a nicer way with dplyr but this should work :)

Jonny Phelps
  • 2,687
  • 1
  • 11
  • 20
  • This works great! Just wanted to find a way to get top_n to work or any other dplyr way. Will keep searching :) –  Jan 07 '21 at 13:05
  • `top_n` doesn't do what you think. If you search `?top_n` in the console, it's finding the top 500 rows, by value. This is important as if you only have one value in the dataset, then selecting 1 row will have the same effect as selecting 5 eg `top_n(data.frame(x=rep(1,10)), 5, "x")` is the same as `top_n(data.frame(x=rep(1,10)), 10, "x")` – Jonny Phelps Jan 07 '21 at 13:21
  • There is a `dplyr` version of `head` called `glimpse` eg https://stackoverflow.com/questions/23408510/head-function-in-r-package-dplyr – Jonny Phelps Jan 07 '21 at 13:21
0

If you want to randomly select yes/no answers you can use this code:

#// generate toy data
df <- data.frame(YN = rep(c("yes", "no"),10), val = runif(20, 1, 100))
head(df)
#>    YN      val
#> 1 yes 26.00628
#> 2  no 98.34237
#> 3 yes 68.05788
#> 4  no 21.87011
#> 5 yes 33.92545
#> 6  no 68.74417

#// set random seed for reproducibility
set.seed(123)

#// randomly sample 5 'yes' answers
yes <- df[sample(which(df$YN == "yes"), 5),]
#// randomly sample 5 'no' answers
no <-  df[sample(which(df$YN == "no"), 5),]

#// create new dataframe with sampled answers
df_sub <- rbind(yes, no)
df_sub
#>     YN       val
#> 5  yes 33.925453
#> 19 yes 53.548253
#> 3  yes 68.057878
#> 15 yes 51.029700
#> 11 yes 91.768337
#> 10  no 11.923457
#> 8   no  8.467184
#> 12  no 63.233610
#> 16  no 93.375332
#> 2   no 98.342369

Created on 2021-01-07 by the reprex package (v0.3.0)

Mario Niepel
  • 1,095
  • 4
  • 19
  • Thank you! But I wanted to create a "balanced" sample based on the first 500 observations of "yes" and "no", so not random, but thanks anyway for the nice example! –  Jan 07 '21 at 13:05
  • The balance comes in by sampling identical numbers for yes and no (the number 5 in the `sample` call in the example -- you would chose 500). If there is a bias in how your answers are arranged then by taking the first answers that match your search criteria you may introduce that bias in your subset and not achieving a balanced set. But you surely know your data better than anybody. – Mario Niepel Jan 07 '21 at 13:08