0

I would like to subsample a dataframe that has an imbalanced number of observations by factor level.

The output I want is another dataframe built from data from the original one where the number of observations by factor level is similar across factor levels (doesn't need to be exactly the same number for each level, but roughly similar).

I am not sure if this called "thinning" the data, or "undersampling" the data.

Consider for instance this dataframe:

data <- data.frame(id = 1:1000,
           class = c(rep("A", 700), rep("B", 200), rep("C", 50), rep("D", 50)))

How can I slice rows so that I extract ~200 rows, 50 for each class A, B, C and D?

I can do this manually, but I would like to find a method that I can use with larger datasets and based on a factor with more levels.

I would also be thankful for advice on the name of what I need (thinning? undersampling? stratified sampling?). Thanks!

Maël
  • 45,206
  • 3
  • 29
  • 67
Javier Fajardo
  • 737
  • 1
  • 10
  • 22

2 Answers2

3

You can use slice_sample in dplyr:

library(dplyr)
data %>% 
  group_by(class) %>% 
  slice_sample(n = 50)

In dplyr 1.1.0 and above:

slice_sample(data, n = 50, by = class)
Maël
  • 45,206
  • 3
  • 29
  • 67
2

Base R option using lapply with split based on group and sample 50 rows. After that combine them back using rbind like this:

df = lapply(split(data, data$class), function(x) x[sample(nrow(x), 50),])
df_sampled = do.call(rbind, df)

# Check number of observations
library(dplyr)
df_sampled %>%
  group_by(class) %>%
  summarise(n = n())
#> # A tibble: 4 × 2
#>   class     n
#>   <chr> <int>
#> 1 A        50
#> 2 B        50
#> 3 C        50
#> 4 D        50

Created on 2023-02-17 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53