0

I have a dataset, espana2015, of a country with schools, students…. I want to eliminate schools with less than 20 students. The variable of the schools is CNTSCHID

dim(espana2015)
[1] 6736  106

The only way, long, manual and not very efficient, is to write one by one the schools. Here are only 13 schools with less than 20 students, but what if there are many more, e.g. more than 100 schools?

espana2015 %>% group_by(CNTSCHID) %>% summarise(students=n())%>%
  filter(students < 20)  %>% select (CNTSCHID) ->removeSch

removeSch
# A tibble: 13 x 1
   CNTSCHID
      <dbl>
 1 72400046
 2 72400113
 3 72400261
 4 72400314
 5 72400396
 6 72400472
 7 72400641
 8 72400700
 9 72400711
10 72400736
11 72400909
12 72400927
13 72400979

espana2015 %>% subset(!CNTSCHID %in% c(72400046,72400113,72400261,
                                      72400314,72400396,72400472,
                                      72400641,72400700,72400711,
                                      72400736,72400909,72400927,
                                      72400979)) -> new_espana2015

Please help me to do it better Walter

r2evans
  • 141,215
  • 6
  • 77
  • 149
walter
  • 1
  • 1
    (1) Why are you using `subset` in a dplyr pipe? Just use `dplyr::filter`. (2) Welcome to SO! Your question is a great first-question (sample code), but it lacks sample data for us to play with, please provide the output from `dput(x)` where `x` is a representative sample of data: no more rows/columns than we need, enough variability of group variables. See https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Dec 21 '21 at 14:20
  • 1
    My *guess*, though: `espana2015 %>% group_by(CNTSCHID) %>% filter(n() > 20)`. – r2evans Dec 21 '21 at 14:21
  • Does this answer your question? [Subset data frame based on number of rows per group](https://stackoverflow.com/questions/20204257/subset-data-frame-based-on-number-of-rows-per-group) – camille Dec 21 '21 at 16:31
  • Also https://stackoverflow.com/q/24503279/5325862 and https://stackoverflow.com/q/18302610/5325862, which each link back to many more related posts – camille Dec 21 '21 at 16:40

1 Answers1

2

Lacking sample data, I'll demonstrate on mtcars, where my cyl is your CNTSHID.

library(dplyr)
table(mtcars$cyl)
#  4  6  8 
# 11  7 14 

mtcars %>%
  group_by(cyl) %>%
  filter(n() > 10) %>%
  ungroup()
# # A tibble: 25 x 11
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#  2  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#  3  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#  4  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#  5  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#  6  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#  7  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#  8  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
#  9  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
# 10  10.4     8  460    215  3     5.42  17.8     0     0     3     4
# # ... with 15 more rows

This works because the conditional in filter resolves to a single logical, and that length-1 true/false is then recycled for all rows in that group. That is, for cyl == 4, (n() > 10) --> (11 > 10) --> TRUE, so the filter is %>% filter(TRUE); the dplyr::filter function does "safe recycling" in a sense, where the conditional must be the same length as the number of rows, or length 1. When it is length 1, it is essentially saying "all or nothing".

r2evans
  • 141,215
  • 6
  • 77
  • 149