How to conditionally count and record if a sample appears in rows of another dataset?

Question

I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column.

Dataset1:

ID
1
2
3

Dataset2:

Interactor1    Interactor2
1                  5
2                  3
1                  10

Output:

ID   InteractionCount    Interactors
1            2               5, 10
2            1                3
3            1                2

So the output contains all IDs of dataset1 and a count of those IDs also appear in either column 1 or 2 of dataset2, and if it did appear it also stores which ID numbers in dataset2 it interacts with.

I have a biology background, so have guessed at approaching this, so far I've managed to use merge() and setDT(mergeddata)[, .N, by=ID] to try to count the dataset1 IDs which appear in dataset2, but I'm not sure if this is the right approach to be able to add in the creation of the column storing the interacting IDs. Any help on possible functions which can store matched IDs in a 3rd column would be appreciated.

Input data:

dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L, 
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

Not quite clear on your Output table. ID=1 appears twice in Dataset2, both in column 1 and they interact with 5 and 10 in column 2. ID=2 appears once, in column 1, it interacts with 3 in column 2. So far, so good. ID=3 appears once in Dataset2, in column 2. I'd say it interacts with 2 from column 1. The answer by @Limey does this. The final row of his output is 3/1/2. But your final row is 3/1/3 implying you want to know that 3 in column 2 interacts with itself. Can you clarify? — DaveTurek, May 19 '20 at 22:05
Thank you for this, it was a typo, the answer Limey gives is what I need, thanks for highlighting. — DN1, May 20 '20 at 10:47

score 2 · Accepted Answer · answered May 19 '20 at 22:32

Here is an option using data.table:

x <- names(DT2)
cols <- c("InteractionCount", "Interactors")

#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)

#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
    DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))

#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]

output for dataset1:

   ID InteractionCount Interactors
1:  1                2       5, 10
2:  2                1           3
3:  3                1           2

data:

library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

Should `dataset1` be `DT1`? – DaveTurek May 19 '20 at 23:36 — DaveTurek, May 19 '20 at 23:36
Yeah I am too lazy to type the longer names – chinsoon12 May 19 '20 at 23:59 — chinsoon12, May 19 '20 at 23:59

DaveTurek · Answer 2 · 2020-05-20T11:22:33.080

Another data.table answer.

library(data.table)
d1 <- data.table(ID=1:3)
d2 <- data.table(I1=c(1,2,1),I2=c(5,3,10))

# first stack I1 on I2 and vice versa
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
#    ID  x
# 1:  1  5
# 2:  1 10
# 3:  2  3
# 4:  5  1
# 5: 10  1
# 6:  3  2

# then collect the desired columns
Output <- Output[ID %in% unlist(d1[(ID)])][
  ,.(InteractionCount=.N,
    Interactors = list(x)),
  by=ID]
Output
#    ID InteractionCount Interactors
# 1:  1                2        5,10
# 2:  2                1           3
# 3:  3                1           2

EDIT: If the IDs are not numeric, you can set a key on d1:

library(data.table)
d1 <- data.table(ID=c("1","2","3A"))
setkey(d1,ID)
d2 <- data.table(I1=c("1","2","1"),I2=c("5","3A","10"))

Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
#    ID  x
# 1:  1  5
# 2:  1 10
# 3:  2  3A
# 4:  5  1
# 5: 10  1
# 6: 3A  2

Output <- Output[ID %in% unlist(d1[(ID)])][
  ,.(InteractionCount=.N,
    Interactors = list(x)),
  by=ID]
Output
#    ID InteractionCount Interactors
# 1:  1                2        5,10
# 2:  2                1          3A
# 3:  3A               1           2

score 1 · Answer 3 · answered May 19 '20 at 17:04

Here's a solution based on the tidyverse package.

library(tidyverse)

d1 <- tibble(ID=1:3)
d2 <- tibble(Interactor1=c(1, 2, 1), Interactor2=c(5, 3, 10))

I think some of your difficulty is caused by the fact that your data is not tidy. You can read about what this means on the tidyverse homepage. Let's make d2 tidy:

d2narrow <- d2 %>% gather(key="Where", value="ID", Interactor1, Interactor2)
d2narrow

which gives:

# A tibble: 6 x 2
  Where          ID
  <chr>       <dbl>
1 Interactor1     1
2 Interactor1     2
3 Interactor1     1
4 Interactor2     5
5 Interactor2     3
6 Interactor2    10

Now getting the InteractionCounts is easy:

counts <- d2narrow %>% group_by(ID) %>% summarise(InteractionCount=n())
counts

# A tibble: 5 x 2
     ID InteractionCount
  <dbl>            <int>
1     1                2
2     2                1
3     3                1
4     5                1
5    10                1

We can get a list of Interactor2s for each value of Interactor1 by going back to the original d2...

interactors1 <- d2 %>% 
                  group_by(Interactor1) %>% 
                  summarise(With1=list(unique(Interactor2))) %>% 
                  rename(ID=Interactor1)
interactors1

# A tibble: 2 x 2
     ID With1    
  <dbl> <list>   
1     1 <dbl [2]>
2     2 <dbl [1]>

If an ID can appear in both Interactor1 and Interactor2, things get a little more fiddly. (That doesn't happen in your example, but just in case...)

interactors2 <- d2 %>% group_by(Interactor2) %>% summarise(With2=list(unique(Interactor1))) %>% rename(ID=Interactor2)
interactors <- interactors1 %>% 
                 full_join(interactors2, by="ID") %>% 
                 unnest(cols=c(With1, With2)) %>% 
                 mutate(With=ifelse(is.na(With1), With2, With1)) %>% 
                 select(-With1, -With2)
interactors <- interactors %>% 
                 group_by(ID) %>% 
                 summarise(Interactors=list(unique(With)))

Now you can bring everything together, and make sure you get the data only for the IDs you want:

interactors <- d1 %>% left_join(counts, by="ID") %>% left_join(interactors, by="ID")
interactors

# A tibble: 3 x 3
     ID InteractionCount Interactors
  <dbl>            <int> <list>     
1     1                2 <dbl [2]>  
2     2                1 <dbl [1]>  
3     3                1 <dbl [1]>

That's the data in the format you requested (one column with a list of interactors for each ID). Just to prove it:

interactors$Interactors[1]

[[1]]
[1]  5 10

But I think you might find it easier to do more with the answer if it's in tidy form:

interactors %>% unnest(cols=c(Interactors))

# A tibble: 4 x 3
     ID InteractionCount Interactors
  <dbl>            <int>       <dbl>
1     1                2           5
2     1                2          10
3     2                1           3
4     3                1           2

Note that your final output does not exactly match OP's shown output, but I suspect it may be what he wants. See my comment above on the Original Post. — DaveTurek, May 19 '20 at 22:09
Thank you for this, at the moment I get an error at: ```interactors <- interactors1 %>% + full_join(interactors2, by="ID") %>% + unnest(cols=c(With1, With2)) %>% + mutate(With=ifelse(is.na(With1), With2, With1)) %>% + select(-With1, -With2) Error: No common size for `With1`, size 68, and `With2`, size 24.``` Do you know why this might be? I'll keep looking to solve this in the meantime as everything else works great and this should all fully solve my problem when it runs. — DN1, May 20 '20 at 10:45
I can confirm my solution works for me. So I can think of three possible reasons for your problem: 1: You're using different data with a feature I haven't allowed for 2: You'rre using a different version of tidyverse 3: You've run into (this bug)[https://stackoverflow.com/questions/56811233/unnesting-a-data-frame-containing-lists]. Try using `unchop()` rather than `unnest()` as suggested. I'll post an Rmd showing the output from my solution if I can figure out how to. — Limey, May 20 '20 at 11:51

How to conditionally count and record if a sample appears in rows of another dataset?

3 Answers3

Linked