Subset data frame based on number of rows that meet a condition

Question

Is there a way to subset my data frame if there are >3 rows of one observation? eg. my data looks like:

ID    Food        Vegetable 
aaa   fruit       lemon 
bbb   fruit       lemon
ccc   fruit       sprout
ddd   fruit       lemon
eee   fruit       lemon
fff   fruit       watermelon

and I'd like it to look like:

ID    Food        Vegetable 
aaa   fruit       lemon 
bbb   fruit       lemon
ddd   fruit       lemon
eee   fruit       lemon

Structure:

structure(list(ID = c("aaa", "bbb", "ccc", "ddd", "eee", "fff"
    ), Food = c("fruit", "fruit", "fruit", "fruit", "fruit", "fruit"), Vegetable = c("lemon", 
    "lemon", "sprout", "lemon", "lemon", "watermelon")), class = "data.frame", row.names = c(NA, 
    -6L))

Thank you

Thanks for the reproducible data! Let's say you have your initial dataset `df` and want a cutoff for the observations per `Vegetable`: `cutoff <- 3`. Then use `library(dplyr)` and simply do `df %>% group_by(Vegetable) %>% filter(n() > cutoff) %>% ungroup()`. The `ungroup()` at the end is optional, to avoid interfering with any later transformations you might want to do. Unfortunately, you weren't very clear when you said _">3 rows of one observation"_. I **assume** you meant `Vegetable`s that occur more than 3 times, but _"observation"_ could really mean anything... — Greg, Jun 17 '21 at 15:54
What do you mean by n> 3 here? Do you mean one vegetable should have at least 3 rows? If yes., group by on this column and filter using n() > 3 — AnilGoyal, Jun 17 '21 at 15:54
@greg thank you that worked perfectly! Could you paste that as an answer so I can tick it off as answered? — Gabriella, Jun 17 '21 at 15:59
Hi Gabriella, and thank you for being so considerate about marking it as an answer! However, I am hesitant to formally submit it as an answer, as I suspect this question might have already been answered elsewhere on Stack Overflow, and it wouldn't be good form for me to farm reputation by answering a duplicate. If I find no duplicates on this site, I'll circle back to you. :) — Greg, Jun 17 '21 at 16:01
Okay, so I'm afraid this is all essentially a duplicate of [this answer](https://stackoverflow.com/a/40091131) to [this question](https://stackoverflow.com/q/20204257). Ironically, even the _cutoff itself_ is the same (`3`). The only difference is that you want to **include** exactly those observations above the cutoff (`> 3`), whereas the original question wanted to **exclude** them (`<= 3`). It's worth checking out the other answers, which include [alternatives](https://stackoverflow.com/a/20204630) in `base` R and are thus portable to other systems (which might not have `dplyr` installed). — Greg, Jun 17 '21 at 16:19
Happy to help! Just a heads up: I'm going to tag this post as a duplicate, to help future users searching for "one answer to rule them all". — Greg, Jun 17 '21 at 16:25

Subset data frame based on number of rows that meet a condition

0 Answers0