When having long format data with a group variable, is there a direct way to remove all rows of a group that has too few rows? Could we operate on the dataframe directly without needing to tally groups first?
Here's an example adapted from here:
library(tidyverse)
set.seed(2024)
my_df <-
iris %>%
group_by(Species) %>%
nest() %>%
ungroup() %>%
mutate(n = sample(1:15, size = 3)) %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(-data) %>%
unnest(samp)
my_df
#> # A tibble: 20 x 6
#> Species n Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 2 5.4 3.9 1.3 0.4
#> 2 setosa 2 5.1 3.8 1.9 0.4
#> 3 versicolor 5 5.5 2.4 3.7 1
#> 4 versicolor 5 5.5 2.6 4.4 1.2
#> 5 versicolor 5 6 2.7 5.1 1.6
#> 6 versicolor 5 6 2.9 4.5 1.5
#> 7 versicolor 5 5 2 3.5 1
#> 8 virginica 13 6.4 3.2 5.3 2.3
#> 9 virginica 13 6.4 2.8 5.6 2.1
#> 10 virginica 13 5.7 2.5 5 2
#> 11 virginica 13 6.3 2.8 5.1 1.5
#> 12 virginica 13 7.2 3.2 6 1.8
#> 13 virginica 13 6.8 3.2 5.9 2.3
#> 14 virginica 13 6.7 3 5.2 2.3
#> 15 virginica 13 7.9 3.8 6.4 2
#> 16 virginica 13 6.9 3.2 5.7 2.3
#> 17 virginica 13 6.3 3.3 6 2.5
#> 18 virginica 13 6.5 3.2 5.1 2
#> 19 virginica 13 6.1 2.6 5.6 1.4
#> 20 virginica 13 6.5 3 5.5 1.8
Created on 2021-03-16 by the reprex package (v0.3.0)
My question: In this example, how can we keep only groups that have more than 10 rows?
my current two-step solution that I dislike
groups_to_keep <-
my_df %>%
count(Species) %>%
filter(n > 10) %>%
pull(Species) %>%
as.character()
> groups_to_keep
## [1] "virginica"
desired_output <-
my_df %>%
filter(Species %in% groups_to_keep)
> desired_output
## # A tibble: 13 x 6
## Species n Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 virginica 13 6.4 3.2 5.3 2.3
## 2 virginica 13 6.4 2.8 5.6 2.1
## 3 virginica 13 5.7 2.5 5 2
## 4 virginica 13 6.3 2.8 5.1 1.5
## 5 virginica 13 7.2 3.2 6 1.8
## 6 virginica 13 6.8 3.2 5.9 2.3
## 7 virginica 13 6.7 3 5.2 2.3
## 8 virginica 13 7.9 3.8 6.4 2
## 9 virginica 13 6.9 3.2 5.7 2.3
## 10 virginica 13 6.3 3.3 6 2.5
## 11 virginica 13 6.5 3.2 5.1 2
## 12 virginica 13 6.1 2.6 5.6 1.4
## 13 virginica 13 6.5 3 5.5 1.8
Is there a way to get from my_df
to desired_output
directly, without having to create groups_to_keep
?