1

Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.

df1<-data.frame(group,individualID,X)
> df1
   group individualID X     
1      1            1  0 
2      1            1  1 
3      1            2  1 
4      1            2  1 
5      2            3  1 
6      2            3  1 
7      3            5  1 
8      3            5  1 
9      3            6  1 
10     3            6  1 
11     4            7  0 
12     4            7  1 

From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.

> aggregate(data = df1,  individualID ~ group, function(x) length(unique(x)))
group individualID 
1 1    2
2 2    1
3 3    2
4 4    1

How can I subset the data to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.

I should end up with only group 1 and group 3.

  • Use `merge` to join your `df1` with the result from `aggregate`; then use `subset`. Also see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 for making a reproducible example, then update your question if the first doesn't work for you. – MrGumble Oct 12 '21 at 13:08
  • 2
    `df %>% group_by(group) %>% filter(n_distinct(individualID) > 1)` using `dplyr`. – Ronak Shah Oct 12 '21 at 13:16

1 Answers1

1

You could make a lookup table to identify the groups that have more than one unique individualID (similar to what you did with aggregate), then filter df1 based on that:

library(dplyr)

lookup <- df1 %>% 
          group_by(group) %>% 
          summarise(count = n_distinct(individualID)) %>%
          filter(count > 1)

df1 %>% filter(group %in% unique(lookup$group))
  group individualID X
1     1            1 0
2     1            1 1
3     1            2 1
4     1            2 1
5     3            5 1
6     3            5 1
7     3            6 1
8     3            6 1

Or, as @MrGumble suggests above, you could also merge df1 after creating lookup:

merge(df1, lookup)
  group individualID X count
1     1            1 0     2
2     1            1 1     2
3     1            2 1     2
4     1            2 1     2
5     3            6 1     2
6     3            6 1     2
7     3            5 1     2
8     3            5 1     2
NovaEthos
  • 500
  • 2
  • 10