how to determine the number of unique values based on multiple criteria dplyr

Question

I've got a df that looks like:

df(site=c(A,B,C,D,E), species=c(1,2,3,4), Year=c(1980:2010).

I would like to calculate the number of different years that each species appear in each site, creating a new column called nYear, I've tried filtering by group and using mutate combined with ndistinct values but it is not quite working.

Here is part of the code I have been using:

Df1 <- Df %>%
  filter(Year>1985)%>%
  mutate(nYear = n_distinct(Year[Year %in% site]))%>%
  group_by(Species,Site, Year) %>% 
  arrange(Species, .by_group=TRUE) 
  ungroup()

score 2 · Accepted Answer · answered Aug 03 '22 at 12:55

The approach is good, a few things to correct.

First, let's make some reproducible data (your code gave errors).

df <- data.frame("site"=LETTERS[1:5], "species"=1:5, "Year"=1981:2010)

You should have used summarise instead of mutate when you're looking to summarise values across groups. It will give you a shortened tibble as an output, with only the groups and the summary figures present (fewer columns and rows).

mutate on the other hand aims to modify an existing tibble, keeping all rows and columns by default.

The order of your functions in the chains also needs to change.

df %>%
  filter(Year>1985) %>%
  group_by(species,site) %>% 
  summarise(nYear = length(unique(Year))) %>% # instead of mutate
  arrange(species, .by_group=TRUE) %>% 
ungroup()

First, group_by(species,site), not year, then summarise and arrange.

# A tibble: 5 × 3
  species site  nYear
    <int> <chr> <int>
1       1 A         5
2       2 B         5
3       3 C         5
4       4 D         5
5       5 E         5

Thanks, this worked and provided the information I was looking to get, however, summarise, as you said, gives a shortened version of my original df, but I would like to keep all the colums in my dataset. I've tried using mutate as per my original question, is there any simple way to keep all columns? — RGR_288, Aug 04 '22 at 10:38
I would suggest to create a new question in order to detail which error you're getting using `mutate`. You can comment here the link to this new question — gaut, Aug 04 '22 at 10:40
https://stackoverflow.com/questions/73235302/cumulative-sum-of-unique-values-based-on-multiple-criteria — RGR_288, Aug 04 '22 at 11:39

score 1 · Answer 2 · answered Aug 03 '22 at 12:58

1

You can use distinct() on the filtered frame, and then count by your groups of interest:

distinct(Df %>% filter(Year>1985)) %>%
  count(Site, Species,name = "nYear")

answered Aug 03 '22 at 12:58

langtang

22,248
1
12
27

this works well indeed. Counting *unique* values even in the presence of other columns in the data. – gaut Aug 03 '22 at 13:03
my solution may be more concise and computationally faster, but you (+1) have provided a more thorough explanation to help the OP understand their approach – langtang Aug 03 '22 at 13:08
Thanks I tried this and also worked great! – RGR_288 Aug 03 '22 at 16:12

how to determine the number of unique values based on multiple criteria dplyr

2 Answers2