0

I have a data frame with many columns and rows with many ids, the following data only shows 2 ids.

id  treatment  
1   A          
1   B          
1   C          
1   D          
1   E          
2   A          
2   B         
2   C  

I want to get the subgroup of id with the five treatments (A,B,C,D,E). So the output table would look like this:

id  treatment
1   A          
1   B        
1   C           
1   D           
1   E   

Thank you very much.

tmfmnk
  • 38,881
  • 4
  • 47
  • 67
Bio_farmer
  • 149
  • 1
  • 7
  • 1
    Just to clarify, you only want `id`s that have all 5 subgroups, correct ? – steveb May 19 '19 at 14:43
  • If this is just about subsetting a dataframe then it is a [duplicate](https://stackoverflow.com/questions/3445590/extract-a-subset-of-a-dataframe-based-on-a-condition-involving-a-field). –  May 19 '19 at 14:49
  • 1
    @gersht I think there is a difference (which I am trying to clarify) where the desired results are filtered by only `ids that have a complete set of treatments. – steveb May 19 '19 at 15:02
  • 1
    One clarification that would be good to add to your question is how you want duplicates handled ? Should they be collapsed into unique items or should duplicate items be kept ? – steveb May 19 '19 at 16:11

6 Answers6

1

To return a subset of ID groups that contain all treatments do something like the following:

install.packages("dplyr")
library(dplyr)

treatments <- c("A", "B", "C", "D", "E")

df %>% group_by(id) %>% filter(all(treatments %in% treatment))

#### OUTPUT ####

# A tibble: 5 x 2
# Groups:   id [1]
     id treatment
  <int> <fct>    
1     1 A        
2     1 B        
3     1 C        
4     1 D        
5     1 E        

The main advantage here is that it deals with repeated conditions correctly. That is to say, in the off chance that you have the identical condition within an ID group, e.g.:

# A tibble: 11 x 2
      id treatment
   <dbl> <chr>    
 1     1 A        
 2     1 A        
 3     1 B        
 4     1 C        
 5     1 D        
 6     1 E        
 7     2 A        
 8     2 A        
 9     2 B        
10     2 B        
11     2 C        

The above code will return all observations of any groups containing every condition:

# A tibble: 6 x 2
# Groups:   id [1]
     id treatment
  <dbl> <chr>    
1     1 A        
2     1 A        
3     1 B        
4     1 C        
5     1 D        
6     1 E        
  • 1
    You can shorten it by changing last statement to `filter(all(LETTERS[1:5] %in% treatment))` – utubun May 19 '19 at 15:47
  • 1
    @utubun, I actually did that orginally, but I changed it to keep things as straightforward as possible, and because the OP's conditions might have names that are not just capital letters. –  May 19 '19 at 15:54
  • 1
    Agreed. It's possible to imagine it as `filter(all(unique(treatment) %in% treatment))` in order to stay within original table. But if there are no `id` with complete range of treatments that will fail. So your solution is probably the most reliable one. – utubun May 19 '19 at 16:01
  • 1
    @gersht In your solution, if the OP desires unique rows, you could pipe into `unique` as follows `df %>% group_by(id) %>% filter(all(treatments %in% treatment)) %>% unique()` – steveb May 19 '19 at 16:24
  • Or, supposing that there are only five possible treatments in a data, we can reduce it back to the solution proposed by @Dij: `group_by(unique(df), id) %>% filter(n() == 5)` – utubun May 19 '19 at 16:38
  • Thank you all very much. – Bio_farmer May 19 '19 at 22:49
0

Here's a super concise way to do it:

 head(my.data)
  id treatment
1  1         A
2  1         B
3  1         C
4  1         D
5  1         E
6  2         A

group_by(my.data, id) %>% filter(n_distinct(treatment) == 5)
# A tibble: 5 x 2
# Groups:   id [1]
     id treatment     
  <int> <chr>     
1     1 A            
2     1 B            
3     1 C            
4     1 D            
5     1 E   

Explanation

Internally, n_distinct(treatment) tallies up instances of the unique values in each id category, since you grouped by that column. Then filter retains the ids that occurred 5 times.

Dij
  • 1,318
  • 1
  • 7
  • 13
  • Yes, you are right if there are only two ids. However, there are many ids, how can I solve my problem? – Bio_farmer May 19 '19 at 14:43
  • Sorry, I don't understand the question..which IDs do you want in the output? – Dij May 19 '19 at 14:44
  • I want to select all the ids with all the treatments (A, B, C, D, E). Thank you. – Bio_farmer May 19 '19 at 14:45
  • 1
    Some ids are only with part of the treatments, which are not what I want to select. – Bio_farmer May 19 '19 at 14:46
  • Are you saying you want the subset of IDs that have all 5 treatments, only? – Dij May 19 '19 at 15:00
  • This doesn't work if conditions are repeated within IDs. –  May 19 '19 at 15:17
  • 1
    @gersht That's true, I assumed they wouldn't be if they were "treatments". But you are right, not sure if that's an issue for OP's data or not – Dij May 19 '19 at 15:18
  • 1
    @Dij you can improve it a bit by changing last line into `filter(n_distinct(treatment) == 5)` – utubun May 19 '19 at 15:51
  • Good point @Dij. My background is in psychology, so I can imagine treatments being repeated over time, i.e. `time=1, id=1, treat=A` and `time=2, id=1, treat=A`. It depends on the OP's needs. I think your answer is fine for this case, but one of the others might be more appropriate for the more general case. –  May 19 '19 at 16:22
  • @Dij There is 14 characters shorter version of your code:) `group_by(unique(dat), id) %>% filter(n() == 5)` – utubun May 19 '19 at 16:44
0

One option is the following, using dplyr.

EDIT

This is a more concise solution, and the original solution is below. This shorter solution also removes duplicate rows. The num.treatments is hard coded to 5 but can be set any way you need to.

library(dplyr)

# load your data
treatment.df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), treatment = c("A", "B", "C", "D", "E", "A", "B", "C")), class = "data.frame", row.names = c(NA, -8L))
num.treatments <- 5
unique(treatment.df[c("id", "treatment")]) %>%
  group_by(id) %>%
  filter(n() == num.treatments)

Original Solution

library(dplyr)

# load your data
treatment.df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), treatment = c("A", "B", "C", "D", "E", "A", "B", "C")), class = "data.frame", row.names = c(NA, -8L))
treatment.df
##   id treatment
## 1  1         A
## 2  1         B
## 3  1         C
## 4  1         D
## 5  1         E
## 6  2         A
## 7  2         B
## 8  2         C

## Get IDs with a complete set of treatments
ids.with.all.treatments <-
  treatment.df %>%
  group_by(id, treatment) %>%
  summarise() %>%          # Get unique set if id/treatment pairs in cases of duplicates.
  summarise(cnt = n()) %>% # Summarise by 'id' to the count per id.
  filter(cnt == 5)         # Get items with the expected number of treatments


treatment.df %>%
  filter(id %in% ids.with.all.treatments$id)
##   id treatment
## 1  1         A
## 2  1         B
## 3  1         C
## 4  1         D
## 5  1         E
steveb
  • 5,382
  • 2
  • 27
  • 36
0

Also a dplyr possibility:

df %>%
 group_by(id) %>%
 filter(grepl("A,B,C,D,E", paste(treatment, collapse = ","), fixed = TRUE))

     id treatment
  <int> <chr>    
1     1 A        
2     1 B        
3     1 C        
4     1 D        
5     1 E 

Here it takes into account also the ordering, meaning it filters out cases where treatment is not going A,B,C,D,E.

If the order is not important, then you can arrange the "treatment" column first:

df %>%
 group_by(id) %>%
 arrange(treatment, .by_group = TRUE) %>%
 filter(grepl("A,B,C,D,E", paste(treatment, collapse = ","), fixed = TRUE))

Considering a scenario shown by @gersht, it can be modified to:

df %>%
 group_by(id) %>%
 filter(grepl("A,B,C,D,E", paste(unique(treatment), collapse = ","), fixed = TRUE))

Or:

df %>%
 group_by(id) %>%
 arrange(treatment, .by_group = TRUE) %>%
 filter(grepl("A,B,C,D,E", paste(unique(treatment), collapse = ","), fixed = TRUE))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0

I thought it would be interesting to give a base R solution,

DF = data.frame(id = c(rep(1,5),rep(2,3)), treatment = c('A','C','B','D','E','A','B','C'))

Note that I permuted the order slightly in treatment to account for its possibility.

do.call(rbind,
  lapply(split(DF,DF$id),
    function(X){
      if(identical(sort(unique(X$treatment)),c('A','B','C','D','E'))) X
      else NULL}))

I allow for the possibility that an id can have multiples of the same treatment since it is unclear what the rest of the data.frame looks like and that the treatments for an id group might not be sorted.

abcxyz
  • 81
  • 3
0

Here's a way in base R using subset and ave (borrowing @steveb's data):

treatments <- LETTERS[1:5]
subset(treatment.df, as.logical(ave(treatment, id, FUN = function(x) all(treatments %in% x))))
#   id treatment
# 1  1         A
# 2  1         B
# 3  1         C
# 4  1         D
# 5  1         E
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167