0

Is there a way to subset my data frame if there are >3 rows of one observation? eg. my data looks like:

ID    Food        Vegetable 
aaa   fruit       lemon 
bbb   fruit       lemon
ccc   fruit       sprout
ddd   fruit       lemon
eee   fruit       lemon
fff   fruit       watermelon

and I'd like it to look like:

ID    Food        Vegetable 
aaa   fruit       lemon 
bbb   fruit       lemon
ddd   fruit       lemon
eee   fruit       lemon

   

Structure:

structure(list(ID = c("aaa", "bbb", "ccc", "ddd", "eee", "fff"
    ), Food = c("fruit", "fruit", "fruit", "fruit", "fruit", "fruit"), Vegetable = c("lemon", 
    "lemon", "sprout", "lemon", "lemon", "watermelon")), class = "data.frame", row.names = c(NA, 
    -6L))

Thank you

Greg
  • 3,054
  • 6
  • 27
Gabriella
  • 421
  • 3
  • 11
  • 3
    Thanks for the reproducible data! Let's say you have your initial dataset `df` and want a cutoff for the observations per `Vegetable`: `cutoff <- 3`. Then use `library(dplyr)` and simply do `df %>% group_by(Vegetable) %>% filter(n() > cutoff) %>% ungroup()`. The `ungroup()` at the end is optional, to avoid interfering with any later transformations you might want to do. Unfortunately, you weren't very clear when you said _">3 rows of one observation"_. I **assume** you meant `Vegetable`s that occur more than 3 times, but _"observation"_ could really mean anything... – Greg Jun 17 '21 at 15:54
  • What do you mean by n> 3 here? Do you mean one vegetable should have at least 3 rows? If yes., group by on this column and filter using n() > 3 – AnilGoyal Jun 17 '21 at 15:54
  • @greg thank you that worked perfectly! Could you paste that as an answer so I can tick it off as answered? – Gabriella Jun 17 '21 at 15:59
  • 1
    Hi Gabriella, and thank you for being so considerate about marking it as an answer! However, I am hesitant to formally submit it as an answer, as I suspect this question might have already been answered elsewhere on Stack Overflow, and it wouldn't be good form for me to farm reputation by answering a duplicate. If I find no duplicates on this site, I'll circle back to you. :) – Greg Jun 17 '21 at 16:01
  • 1
    Okay, so I'm afraid this is all essentially a duplicate of [this answer](https://stackoverflow.com/a/40091131) to [this question](https://stackoverflow.com/q/20204257). Ironically, even the _cutoff itself_ is the same (`3`). The only difference is that you want to **include** exactly those observations above the cutoff (`> 3`), whereas the original question wanted to **exclude** them (`<= 3`). It's worth checking out the other answers, which include [alternatives](https://stackoverflow.com/a/20204630) in `base` R and are thus portable to other systems (which might not have `dplyr` installed). – Greg Jun 17 '21 at 16:19
  • Happy to help! Just a heads up: I'm going to tag this post as a duplicate, to help future users searching for "one answer to rule them all". – Greg Jun 17 '21 at 16:25

0 Answers0