2

I would like to merge two data frames where one has more variables (columns) whereas the other has more observations (rows). A simplified example of how they are set up are below:

Dataframe 1:

ID      Date         Indicator
12345   01/01/2008   1
54321   12/01/2008   1

Dataframe 2:

ID      Date         
12345   01/01/2008   
12345   01/31/2008
12345   02/28/2009
24681   01/01/2008
54321   12/01/2008
54321   12/20/2008

What I would like to do is only keep the rows in which the ID's are an exact match. For example, I would like the following output:

New dataframe:

ID      Date         Indicator     
12345   01/01/2008   1
12345   01/31/2008   NA
12345   02/28/2009   NA
54321   12/01/2008   1
54321   12/20/2008   NA

I have tried

new <- merge(df1, df2, all=TRUE)

but this results in the merging of ALL rows, whereas I only want the rows from df2 with ID that are seen in df1.

Any help is appreciated!

sh2
  • 45
  • 9

7 Answers7

2

You can try with a dplyr solution:

library(dplyr)
# a right join when you filter Dataframe2 by ID in Dataframe1
Dataframe1 %>% right_join(Dataframe2[Dataframe2$ID %in% Dataframe1$ID,])  

Joining, by = c("ID", "Date")
     ID       Date Indicator
1 12345 01/01/2008         1
2 12345 01/31/2008        NA
3 12345 02/28/2009        NA
4 54321 12/01/2008         1
5 54321 12/20/2008        NA

# clearly you can put it in a data.frame
Dataframe3 <- Dataframe1 %>% right_join(Dataframe2[Dataframe2$ID %in% Dataframe1$ID,], by = 'ID') %>% 
             data.frame()

So you'll not have ID 24681 and you'll have NA in Indicator where it seems it's necessary, i.e. when Date does not want it.


Your data:

Dataframe1 <- read.table(text="ID      Date         Indicator
12345   01/01/2008   1
54321   12/01/2008   1", stringsAsFactors = F, header = T)

Dataframe2 <- read.table(text="ID      Date         
12345   01/01/2008   
12345   01/31/2008
12345   02/28/2009
24681   01/01/2008
54321   12/01/2008
54321   12/20/2008",stringsAsFactors = F, header = T)
s__
  • 9,270
  • 3
  • 27
  • 45
  • I've tried it, but is going to give this `ID Date Indicator 1 12345 01/01/2008 1 2 54321 12/01/2008 1`, and it seems not what OP needs. – s__ Oct 03 '18 at 20:43
  • Hi @sh2 does this help you to solve the issue or do you need other help? – s__ Oct 06 '18 at 09:20
1

Just try:

library(dplyr)
df2 %>%
  left_join(df1, by = c("ID", "Date")) %>% # or full_join(df1, by = c("ID", "Date"))
  filter(ID %in% df1$ID) 

Or based on what you started:

merge(df1, df2, all = TRUE) %>% filter(ID %in% df1$ID)
nghauran
  • 6,648
  • 2
  • 20
  • 29
1

You can try the function join() from the ply library. You also require an extra step to get the exact output you need.

library(plyr)

df1

     ID       Date Indicator
1 12345 2020-01-01         1
2 54321 2020-12-01         1

 df2

     ID       Date
1 12345 2020-01-01
2 12345 2020-01-31
3 12345 2020-02-28
4 24681 2020-01-01
5 54321 2020-12-01
6 54321 2020-12-20

# that extra step
df3 <- df2[df2$ID %in% df1$ID,]
df3
     ID       Date
1 12345 2020-01-01
2 12345 2020-01-31
3 12345 2020-02-28
5 54321 2020-12-01
6 54321 2020-12-20

join(df3, df1, by = c("ID", "Date"))
     ID       Date Indicator
1 12345 2020-01-01         1
2 12345 2020-01-31        NA
3 12345 2020-02-28        NA
4 54321 2020-12-01         1
5 54321 2020-12-20        NA
dcruz
  • 11
  • 1
1

If your data size is not too large, you can add a line that filters your results by df1$id.

new <- new[new$id %in% unique(df1$id),]
scs
  • 567
  • 6
  • 22
0

Edited based on s_t's comment:

left_join(df2, df1, by=c("ID", "Date")) %>% filter(ID %in% df1$ID)
MR_MPI-BGC
  • 265
  • 3
  • 11
  • 1
    This seems not give the desired result, as it's `ID Date Indicator 1 12345 01/01/2008 1 2 54321 12/01/2008 1`. – s__ Oct 03 '18 at 20:45
  • 1
    Yes, you are right! I misread the task `left_join(df2, df1, by=c("ID", "Date")) %>% filter(ID %in% df1$ID)` is correct. Which is quite similar to what you propose. – MR_MPI-BGC Oct 04 '18 at 08:10
0

Consider merge with subset:

df3 <- subset(merge(df1, df2, by=c("ID", "Date"), all=TRUE), ID %in% df1$ID)

df3
#      ID       Date Indicator
# 1 12345 01/01/2008         1
# 2 12345 01/31/2008        NA
# 3 12345 02/28/2009        NA
# 5 54321 12/01/2008         1
# 6 54321 12/20/2008        NA

To reset row.names, wrap the data.frame() constructor and specify row names argument:

df3 <- data.frame(subset(merge(df1, df2, by=c("ID", "Date"), all=TRUE),
                         ID %in% df1$ID),
                  row.names = NULL)

df3
#      ID       Date Indicator
# 1 12345 01/01/2008         1
# 2 12345 01/31/2008        NA
# 3 12345 02/28/2009        NA
# 4 54321 12/01/2008         1
# 5 54321 12/20/2008        NA
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

a join is what you are looking for. If the table which you plan to keep as reference is on the left then its a left join. A sample code

    df1<-data.frame(ID=c(12345,54321) ,Date  =c('01/01/2008',' 12/01/2008 ')   ,    
     Indicator=c(1,1))

     df2<-data.frame(ID=c(12345,12345,5341) ,Date  =c('01/01/2008',' 12/01/2008 
      ','12/1/2008') )

    merge(df1,df2,by.x = 'ID',by.y='ID')

      ID     Date.x       Indicator       Date.y
      12345 01/01/2008         1    01/01/2008
      12345 01/01/2008         1    12/01/2008 

So only the rows from df1 that are present in df 2 are part of the output

rahul
  • 561
  • 1
  • 5
  • 13