How can I select only those cases that are present in two datasets?

Question

Introduction to the problem

I have a dataset that is composed of two datasets. One contains baseline data (t1), the other contains follow-up data(t2). Not all cases in the baseline data are present in the follow-up data and not all cases in the follow-up data are present in the baseline data.

Aim

I would like to select only those cases that are present in both datasets. For this purpose I combined both data sets.

Example of data

      ID     Tn    Score 1    ......   Score n
1      1     t1      1        ......      n
2      1     t2      1        ......      n
3      2     t1      1        ......      n
4      3     t2      2        ......      n
5      4     t1      3        ......      n
6      4     t2      1        ......      n
7      5     t1      2        ......      n
8      6     t2      1        ......      n
9      7     t1      5        ......      n
10     7     t2      4        ......      n

In my example (above) I thus would like cases 1, 4, and 7 for subsequent analyses.

What I have tried

I tried working with unique() and duplicated(), but so far I have not found a working solution.

Keep `t1` and `t2` as separate dataframes and use inner join. — zx8754, Oct 21 '15 at 09:19
@zx8754 It can be solved using `merge`, but that is not needed for the input combined dataset. — akrun, Oct 21 '15 at 09:38
@akrun in the *Aim* section OP says, he needs inner-join for this purpose I combined the data. He shouldn't have combined, then it is 100% dupe of above merge post. — zx8754, Oct 21 '15 at 09:45

Gaurav · Accepted Answer · 2015-10-21T09:50:11.703

2

You can also do using base R...

rbind(merge(merge(subset(subset(df1,Tn == 't1'),select=c(ID)),subset(subset(df1,Tn == 't2'),select=c(ID)),by="ID"),subset(subset(df1,Tn == 't1'),by='ID')),merge(merge(subset(subset(df1,Tn == 't1'),select=c(ID)),subset(subset(df1,Tn == 't2'),select=c(ID)),by="ID"),subset(subset(df1,Tn == 't2'),by='ID')))

Another more elegant way to do it is...

subset(df1,(ID %in% subset(df1,select=c(ID), Tn == 't1')$ID) & (ID %in% subset(df1,select=c(ID), Tn == 't2')$ID))

edited Oct 21 '15 at 09:50

answered Oct 21 '15 at 09:14

Gaurav

1,597
2
14
31

The subsetting alternative worked really well. Thank you! – L. Bakker Nov 05 '15 at 08:01

akrun · Answer 2 · 2015-10-21T09:44:23.910

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', if the unique counts of 'Tn' is equal to 2, we Subset the Data.Table (.SD).

library(data.table)#v1.9.6+
setDT(df1)[, if(uniqueN(Tn)==2) .SD, ID]
#   ID Tn Score1 Score2
#1:  1 t1      1      1
#2:  1 t2      1      3
#3:  4 t1      3      2
#4:  4 t2      1      1
#5:  7 t1      5      1
#6:  7 t2      4      5

Or using base R

 df1[df1$ID %in% names(which(rowSums(table(df1[1:2]))==2)) ,]

data

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 6L, 7L, 7L), 
Tn = c("t1", "t2", "t1", "t2", "t1", "t2", "t1", "t2", "t1", 
"t2"), Score1 = c(1L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 5L, 4L), 
Score2 = c(1L, 3L, 2L, 4L, 2L, 1L, 1L, 4L, 1L, 5L)), .Names = c("ID", 
"Tn", "Score1", "Score2"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

score 1 · Answer 3 · answered Oct 21 '15 at 09:30

1

df1[which(df1$ID  %in% df1[duplicated(df1$ID), ]$ID), ]

#ID Tn Score1 Score2
#1   1 t1      1      1
#2   1 t2      1      3
#5   4 t1      3      2
#6   4 t2      1      1
#9   7 t1      5      1
#10  7 t2      4      5

answered Oct 21 '15 at 09:30

Ronak Shah

377,200
20
156
213

I don't think this is correct based on the description in the OP's post though it gives the output. – akrun Oct 21 '15 at 09:39
@akrun I didn't get you. Will this fail in any specific condition? – Ronak Shah Oct 21 '15 at 09:42

How can I select only those cases that are present in two datasets?

Introduction to the problem

Aim

Example of data

What I have tried

3 Answers3

data