1

I have a dataframe from which I'd like to extract a subset based on a group condition: for a given year x, if a species only counts 1 individual, then remove it from the df.

I am able to have a subset of this kind:


df %>%
  group_by(species,year) %>%
  summarise(n_inds = n()) %>%
  filter(n_inds > 1)

which gives this results

# A tibble: 1,915 x 3
   espece                     year n_inds
   <fct>                     <dbl>  <int>
 1 Agelaioides badius         2003      5
 2 Agelaioides badius         2004      3
 3 Agelaioides badius         2005      4
 4 Amaurospiza moesta         2005      2
 5 Amaurospiza moesta         2014      2
 6 Amblyramphus holosericeus  2006      2
 7 Ammodramus humeralis       2010      4
 8 Ammodramus humeralis       2011      3
 9 Anabacerthia amaurotis     2001      3
10 Anabacerthia amaurotis     2004      5
# ... with 1,905 more rows

but it's not totally what I want. This df tells me, for example for the 1st row, that they are 5 individuals of Agelaioides badius in 2003 that I want to keep in my original df, alongside all the columns with the different measurements for each corresponding bird (I'm working on birds).

If someone has a solution! :)

Thanks a lot

PS: the original df is counting 19501 observations of 9 variables.

Recology
  • 165
  • 1
  • 10

1 Answers1

2

We don't need the summarise step. Instead use the logical expression directly in filter

library(dplyr)
df %>%
    group_by(species, year) %>%
    filter(n() > 1)

If we need to create the 'n_inds', then use either add_count

df %>%
  add_count(species, year) %>%
  filter(n > 1)

Or create the column with mutate

df %>%
  group_by(species, year) %>%
  mutate(ninds = n()) %>%
  ungroup %>%
  filter(ninds > 1)

When we use summarise, it only returns the grouping columns and the summarised column

akrun
  • 874,273
  • 37
  • 540
  • 662
  • I then have a subsequent question: how would you, from this new df, erase all the rows containing the species that appear only one year (and therefore are not seen ever again), no matter this time the number of individuals? For example, I keep "species 1" seen on year x and y but not "species 2" only seen on year z. – Recology Dec 19 '20 at 21:48
  • @Recology Perhaps you want `df %>% group_by(species) %>% filter(n_distinct(year) > 1)` – akrun Dec 19 '20 at 21:50
  • It seems to work fine! Would you know a simple way to double check if there are no mistakes in the df, regarding what my conditions are? :) – Recology Dec 19 '20 at 22:16
  • @Recology when you say mistakes, are you checking whether all 'species' after the `filter` have only unique 'year'? – akrun Dec 19 '20 at 22:19
  • I would like to be 100% sure that every species could be found always on multiple years (>=2 different years). Even if looking at the df it seems to be the case, is there a way to prove it? – Recology Dec 19 '20 at 22:22
  • @Recology Once you subset the data you can always check with `df1 <- df %>% group_by(species) %>% filter(n_distinct(year) > 1) %>% ungroup; table(unique(df1[c('species', 'year')]))` or use a logical condition i.e. `df %>% distinct(species, year) %>% count(species, year) %>% transmute(flag = n > 1) %>% pull(flag) %>% all` – akrun Dec 19 '20 at 22:24
  • @Recology So either of those codes should work. In the second case, by using `all`, it should return a single TRUE if all have more than one year per species – akrun Dec 19 '20 at 22:27
  • ```> df %>% distinct(espece, year) %>% count(espece, year) %>% transmute(flag = n > 1) %>% pull(flag) %>% all [1]FALSE ``` this is what I got, so it's not the expected TRUE? – Recology Dec 19 '20 at 22:34
  • @Recology I meant after the `filter`. sorry, I should have added `df1` instead of `df` i.e. `df1%>% distinct(species, year) %>% count(species, year) %>% transmute(flag = n > 1) %>% pull(flag) %>% all` – akrun Dec 19 '20 at 22:36
  • sorry I don't get it this time :/ – Recology Dec 19 '20 at 22:40
  • @Recology I meant `df1 <- df %>% group_by(species) %>% filter(n_distinct(year) > 1) %>% ungroup` as the filtered data. Then you are doing the check on 'df1' which is the filtered output. – akrun Dec 19 '20 at 22:41
  • oh okay it's clearer thanks a lot, but I still got FALSE as an answer :/ – Recology Dec 19 '20 at 22:45
  • this is what I have done: ```df1 = df %>% group_by(espece) %>% filter(n_distinct(year) > 1) %>% ungroup df1 %>% distinct(espece, year) %>% count(espece, year) %>% transmute(flag = n > 1) %>% pull(flag) %>% all``` – Recology Dec 19 '20 at 22:46
  • @Recology Do you have `NA` elements. Then use `df1 <- df %>% group_by(species) %>% filter(n_distinct(year, na.rm = TRUE) > 1) %>% ungroup` and use `df1%>% distinct(species, year) %>% na.omit %>% count(species, year) %>% transmute(flag = n > 1) %>% pull(flag) %>% all` – akrun Dec 19 '20 at 22:47
  • I don't have any NA: ```sum(is.na(df)) [1] 0``` – Recology Dec 19 '20 at 22:50
  • @Recology Can you check the output of `df1%>% distinct(species, year) %>% na.omit %>% count(species, year)` – akrun Dec 19 '20 at 22:51
  • ```A tibble: 1,865 x 3 espece year n 1 Agelaioides badius 2003 1 2 Agelaioides badius 2004 1 3 Agelaioides badius 2005 1 4 Amaurospiza moesta 2005 1 5 Amaurospiza moesta 2014 1 6 Ammodramus humeralis 2010 1 7 Ammodramus humeralis 2011 1 8 Anabacerthia amaurotis 2001 1 9 Anabacerthia amaurotis 2004 1 10 Anabacerthia amaurotis 2012 1 # ... with 1,855 more rows``` – Recology Dec 19 '20 at 22:53
  • @Recology Sorry, I meant `df1 %>% distinct(species, year) %>% count(species) %>% mutate(flag = n > 1) %>% pull(flag) %>% all` – akrun Dec 19 '20 at 22:56
  • The reason is that when we do the `count` on both 'species', 'year', it will return only a single 1. Instead, on the subset, it should be only for `species` count. If there are more than one unique 'year', it will always be greater than 1 – akrun Dec 19 '20 at 22:57
  • 1
    Now I have this answer: ``` df1 %>% distinct(espece, year) %>% count(espece) %>% mutate(flag = n > 1) %>% pull(flag) %>% all [1] TRUE``` is this what we are looking for? – Recology Dec 19 '20 at 22:59
  • @Recology yes, that is the answer. sorry for the previous code. I should have tested it – akrun Dec 19 '20 at 22:59