1

Introduction to the problem

I have a dataset that is composed of two datasets. One contains baseline data (t1), the other contains follow-up data(t2). Not all cases in the baseline data are present in the follow-up data and not all cases in the follow-up data are present in the baseline data.

Aim

I would like to select only those cases that are present in both datasets. For this purpose I combined both data sets.

Example of data

      ID     Tn    Score 1    ......   Score n
1      1     t1      1        ......      n
2      1     t2      1        ......      n
3      2     t1      1        ......      n
4      3     t2      2        ......      n
5      4     t1      3        ......      n
6      4     t2      1        ......      n
7      5     t1      2        ......      n
8      6     t2      1        ......      n
9      7     t1      5        ......      n
10     7     t2      4        ......      n

In my example (above) I thus would like cases 1, 4, and 7 for subsequent analyses.

What I have tried

I tried working with unique() and duplicated(), but so far I have not found a working solution.

Community
  • 1
  • 1
L. Bakker
  • 147
  • 1
  • 13
  • Keep `t1` and `t2` as separate dataframes and use inner join. – zx8754 Oct 21 '15 at 09:19
  • @zx8754 It can be solved using `merge`, but that is not needed for the input combined dataset. – akrun Oct 21 '15 at 09:38
  • @akrun in the *Aim* section OP says, he needs inner-join for this purpose I combined the data. He shouldn't have combined, then it is 100% dupe of above merge post. – zx8754 Oct 21 '15 at 09:45
  • @zx8754 In that case, it will be a dupe. – akrun Oct 21 '15 at 09:53

3 Answers3

2

You can also do using base R...

rbind(merge(merge(subset(subset(df1,Tn == 't1'),select=c(ID)),subset(subset(df1,Tn == 't2'),select=c(ID)),by="ID"),subset(subset(df1,Tn == 't1'),by='ID')),merge(merge(subset(subset(df1,Tn == 't1'),select=c(ID)),subset(subset(df1,Tn == 't2'),select=c(ID)),by="ID"),subset(subset(df1,Tn == 't2'),by='ID')))

Another more elegant way to do it is...

subset(df1,(ID %in% subset(df1,select=c(ID), Tn == 't1')$ID) & (ID %in% subset(df1,select=c(ID), Tn == 't2')$ID))
Gaurav
  • 1,597
  • 2
  • 14
  • 31
1

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'ID', if the unique counts of 'Tn' is equal to 2, we Subset the Data.Table (.SD).

library(data.table)#v1.9.6+
setDT(df1)[, if(uniqueN(Tn)==2) .SD, ID]
#   ID Tn Score1 Score2
#1:  1 t1      1      1
#2:  1 t2      1      3
#3:  4 t1      3      2
#4:  4 t2      1      1
#5:  7 t1      5      1
#6:  7 t2      4      5

Or using base R

 df1[df1$ID %in% names(which(rowSums(table(df1[1:2]))==2)) ,]

data

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 6L, 7L, 7L), 
Tn = c("t1", "t2", "t1", "t2", "t1", "t2", "t1", "t2", "t1", 
"t2"), Score1 = c(1L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 5L, 4L), 
Score2 = c(1L, 3L, 2L, 4L, 2L, 1L, 1L, 4L, 1L, 5L)), .Names = c("ID", 
"Tn", "Score1", "Score2"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
akrun
  • 874,273
  • 37
  • 540
  • 662
1
df1[which(df1$ID  %in% df1[duplicated(df1$ID), ]$ID), ]

#ID Tn Score1 Score2
#1   1 t1      1      1
#2   1 t2      1      3
#5   4 t1      3      2
#6   4 t2      1      1
#9   7 t1      5      1
#10  7 t2      4      5
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213