I would like to subsample a dataframe that has an imbalanced number of observations by factor level.
The output I want is another dataframe built from data from the original one where the number of observations by factor level is similar across factor levels (doesn't need to be exactly the same number for each level, but roughly similar).
I am not sure if this called "thinning" the data, or "undersampling" the data.
Consider for instance this dataframe:
data <- data.frame(id = 1:1000,
class = c(rep("A", 700), rep("B", 200), rep("C", 50), rep("D", 50)))
How can I slice rows so that I extract ~200 rows, 50 for each class A, B, C and D?
I can do this manually, but I would like to find a method that I can use with larger datasets and based on a factor with more levels.
I would also be thankful for advice on the name of what I need (thinning? undersampling? stratified sampling?). Thanks!