R: find most frequent combinations within same id

Question

I have an problem in counting the number of combinations of drugs. My data is organized in two data frames. df1 contains id and found drugs, as such:

ID | drug
-----------
1  | drug1
1  | drug2
1  | drug3
2  | drug3
2  | drug5
3  | drug1
3  | drug3
3  | drug4
3  | drug5

df2 shows all possible drug combination of 2 different drugs, as such:

combi1 | combi2
-----------------
drug1  | drug2
drug1  | drug3
drug1  | drug4
drug2  | drug3
drug2  | drug4
drug2  | drug5

With 7140 possible combinations in total. What I want is to find out how many IDs have combination drug1-drug2, drug1-drug3and so forth.

I have been trying a double for loop:

counter=0
for(com in 1:nrow(df2)){
 for(id in 1:unique(df1$ID)){
   if(df2$combi1[com] %in% df1$drug[id] & df2$combi2[com] %in% df1$drug[id])   {
  counter=counter+1
  }
}
df2$count=counter
counter=0
}

But it doesn't work, because it is only able to check one row at a time. I have also tried the solution in Find Most Frequent Combination within a Vector by Group, but without any luck.

Furthermore, I need to do the same with combinations of three drugs

EDIT: I like the output to be like this in df2, where I can see, how many times drug1 and drug2 has occurred as combination within ID. For example, only one ID had both drug1 and drug2, whereas 2 IDs had drug1 and drug3

combi1 | combi2 | count
-----------------------
drug1  | drug2  |   1
drug1  | drug3  |   2
drug1  | drug4  |   0
drug2  | drug3  |   1
drug2  | drug4  |   0
drug2  | drug5  |   0

See [this similar post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses); `cbind(df2, n = crossprod(table(df1))[as.matrix(df2)])` — alexis_laz, Nov 01 '16 at 13:55

sebastian-c · Accepted Answer · 2016-11-02T14:26:40.017

1

For this one, I reached for data.table, but you could use tidyr just as easily.

library(data.table)
set.seed(213) # set seed
d <- data.table(ID = rep(1:3, each = 3), drug = paste0("drug", sample(1:5, 9, rep = T))) 

get_combs <- function(x, n = 2){
  uniq_x <- sort(unique(x))
  if(length(uniq_x) < n){
    return(NULL)
  } else {
    return(as.data.frame(t(combn(uniq_x, n)), stringsAsFactors = FALSE))
  }

}

combi <- d[, get_combs(drug), by = ID][order(V1, V2),]
combi[ , .N, by = .(V1, V2)]

      V1    V2 N
1: drug1 drug2 2
2: drug1 drug4 2
3: drug2 drug4 2
4: drug3 drug5 1

edited Nov 02 '16 at 14:26

answered Nov 01 '16 at 11:13

sebastian-c

15,057
3
47
93

That is not really a combination of drugs. – reuss Nov 01 '16 at 12:07
Sorry, you're right - I misinterpreted your question. I'll give it another shot. – sebastian-c Nov 01 '16 at 13:03
That second output helped clarify the issue, thanks. – sebastian-c Nov 01 '16 at 13:40
It is strange. When I change your code from the dummy set above to the real data, it claims that `Error in `[.data.frame`(mydata, ,get_combs(drug), by = ID) : unused argument (by = ID)` – reuss Nov 01 '16 at 14:13
@reuss: Make sure that `mydata` is a `data.table` (which is an enhanced data.frame). For instance by using `setDT(mydata)`. – Uwe Nov 01 '16 at 14:47
Ahh. There was the issue with mydata. It works now, when I used `setDT(mydata)`. The next issue is that I end up with both a drug1-drug2 (lets say n=2) and drug2-drug1 (n=1) combinations, whereas I only want drug1-drug2 combination (n=3). Maybe ordered alphabetical. – reuss Nov 01 '16 at 15:01
@reuss, in `get_combs` try replacing in `uniq_x <- sort(unique(x))` – sebastian-c Nov 01 '16 at 15:32
@sebastian-c It works perfectly! Now, what if I want to expand it to 3 drug combinations? Such as drug1-drug2-drug3. I tried to change from `combn(uniq_x,2)` to `combn(uniq_x,3)` and then adding a V3 in the two entries, where V1, V2 are. However, it returns an error from `get_combs`: Error in combn(uniq_x, 3) : n < m – reuss Nov 01 '16 at 16:33
The problem is illustrated in `combn(1:2, 3)`. What behaviour do you want when the function tries to take 2 drugs three at a time? I've modified my answer to return NULL in that case. – sebastian-c Nov 02 '16 at 14:23
@sebastian-c, with your new addition, it seems to be working perfect for 3 drug combinations. Even also 4 and 5 drug combinations. Thanks you very much for all your help! – reuss Nov 03 '16 at 07:14

coffeinjunky · Answer 2 · 2016-11-01T12:16:18.877

0

It might be easier to reshape the data:

library(reshape2)
set.seed(213) # set seed
df <- data.frame(ID = rep(1:3, each = 3), drug = paste0("drug", sample(1:5, 9, rep = T))) #define data
df <- dcast(df, ID ~ drug)
df
  ID drug1 drug2 drug3 drug4 drug5
1  1     1     1     0     1     0
2  2     0     0     2     0     1
3  3     1     1     0     1     0

Now you have the combinations in one row per ID and you can use standard subsetting to find all IDs with certain combinations. Is this what you are looking for? If not, please add the desired output to your question.

edited Nov 01 '16 at 12:16

answered Nov 01 '16 at 12:10

coffeinjunky

11,254
39
57

No, not really. I want to know have many IDs have both drug1 and drug2. EDIT: you just added more to your answer, after I wrote this – reuss Nov 01 '16 at 12:20
I have added an edit to my question in the bottom with the desired output – reuss Nov 01 '16 at 12:33

R: find most frequent combinations within same id

2 Answers2