Merge data frames by exact match of column

Question

I would like to merge two data frames where one has more variables (columns) whereas the other has more observations (rows). A simplified example of how they are set up are below:

Dataframe 1:

ID      Date         Indicator
12345   01/01/2008   1
54321   12/01/2008   1

Dataframe 2:

ID      Date         
12345   01/01/2008   
12345   01/31/2008
12345   02/28/2009
24681   01/01/2008
54321   12/01/2008
54321   12/20/2008

What I would like to do is only keep the rows in which the ID's are an exact match. For example, I would like the following output:

New dataframe:

ID      Date         Indicator     
12345   01/01/2008   1
12345   01/31/2008   NA
12345   02/28/2009   NA
54321   12/01/2008   1
54321   12/20/2008   NA

I have tried

new <- merge(df1, df2, all=TRUE)

but this results in the merging of ALL rows, whereas I only want the rows from df2 with ID that are seen in df1.

Any help is appreciated!

`merge(df1, df2, by="ID", all = TRUE)` according to this https://stackoverflow.com/a/6709335/1719510 — foc, Oct 03 '18 at 20:33
@foc, OP seems not want ID 24681, and seems that the join is also on date. — s__, Oct 03 '18 at 20:39
@s_t thank you. I didn't run into the results. Trying to learn answering. — foc, Oct 03 '18 at 20:43

s__ · Answer 1 · 2018-10-03T20:50:39.027

You can try with a dplyr solution:

library(dplyr)
# a right join when you filter Dataframe2 by ID in Dataframe1
Dataframe1 %>% right_join(Dataframe2[Dataframe2$ID %in% Dataframe1$ID,])  

Joining, by = c("ID", "Date")
     ID       Date Indicator
1 12345 01/01/2008         1
2 12345 01/31/2008        NA
3 12345 02/28/2009        NA
4 54321 12/01/2008         1
5 54321 12/20/2008        NA

# clearly you can put it in a data.frame
Dataframe3 <- Dataframe1 %>% right_join(Dataframe2[Dataframe2$ID %in% Dataframe1$ID,], by = 'ID') %>% 
             data.frame()

So you'll not have ID 24681 and you'll have NA in Indicator where it seems it's necessary, i.e. when Date does not want it.

Your data:

Dataframe1 <- read.table(text="ID      Date         Indicator
12345   01/01/2008   1
54321   12/01/2008   1", stringsAsFactors = F, header = T)

Dataframe2 <- read.table(text="ID      Date         
12345   01/01/2008   
12345   01/31/2008
12345   02/28/2009
24681   01/01/2008
54321   12/01/2008
54321   12/20/2008",stringsAsFactors = F, header = T)

I've tried it, but is going to give this `ID Date Indicator 1 12345 01/01/2008 1 2 54321 12/01/2008 1`, and it seems not what OP needs. — s__, Oct 03 '18 at 20:43
Hi @sh2 does this help you to solve the issue or do you need other help? — s__, Oct 06 '18 at 09:20

nghauran · Answer 2 · 2018-10-03T21:10:36.530

1

Just try:

library(dplyr)
df2 %>%
  left_join(df1, by = c("ID", "Date")) %>% # or full_join(df1, by = c("ID", "Date"))
  filter(ID %in% df1$ID)

Or based on what you started:

merge(df1, df2, all = TRUE) %>% filter(ID %in% df1$ID)

edited Oct 03 '18 at 21:10

answered Oct 03 '18 at 20:53

nghauran

6,648
2
20
29

score 1 · Answer 3 · answered Oct 03 '18 at 20:54

You can try the function join() from the ply library. You also require an extra step to get the exact output you need.

library(plyr)

df1

     ID       Date Indicator
1 12345 2020-01-01         1
2 54321 2020-12-01         1

 df2

     ID       Date
1 12345 2020-01-01
2 12345 2020-01-31
3 12345 2020-02-28
4 24681 2020-01-01
5 54321 2020-12-01
6 54321 2020-12-20

# that extra step
df3 <- df2[df2$ID %in% df1$ID,]
df3
     ID       Date
1 12345 2020-01-01
2 12345 2020-01-31
3 12345 2020-02-28
5 54321 2020-12-01
6 54321 2020-12-20

join(df3, df1, by = c("ID", "Date"))
     ID       Date Indicator
1 12345 2020-01-01         1
2 12345 2020-01-31        NA
3 12345 2020-02-28        NA
4 54321 2020-12-01         1
5 54321 2020-12-20        NA

score 1 · Answer 4 · answered Oct 03 '18 at 20:55

1

If your data size is not too large, you can add a line that filters your results by df1$id.

new <- new[new$id %in% unique(df1$id),]

answered Oct 03 '18 at 20:55

scs

567
6
22

MR_MPI-BGC · Answer 5 · 2018-10-04T08:19:15.423

0

Edited based on s_t's comment:

left_join(df2, df1, by=c("ID", "Date")) %>% filter(ID %in% df1$ID)

edited Oct 04 '18 at 08:19

answered Oct 03 '18 at 20:41

MR_MPI-BGC

265
3
11

1

This seems not give the desired result, as it's `ID Date Indicator 1 12345 01/01/2008 1 2 54321 12/01/2008 1`. – s__ Oct 03 '18 at 20:45
1

Yes, you are right! I misread the task `left_join(df2, df1, by=c("ID", "Date")) %>% filter(ID %in% df1$ID)` is correct. Which is quite similar to what you propose. – MR_MPI-BGC Oct 04 '18 at 08:10

score 0 · Answer 6 · answered Oct 03 '18 at 20:52

Consider merge with subset:

df3 <- subset(merge(df1, df2, by=c("ID", "Date"), all=TRUE), ID %in% df1$ID)

df3
#      ID       Date Indicator
# 1 12345 01/01/2008         1
# 2 12345 01/31/2008        NA
# 3 12345 02/28/2009        NA
# 5 54321 12/01/2008         1
# 6 54321 12/20/2008        NA

To reset row.names, wrap the data.frame() constructor and specify row names argument:

df3 <- data.frame(subset(merge(df1, df2, by=c("ID", "Date"), all=TRUE),
                         ID %in% df1$ID),
                  row.names = NULL)

df3
#      ID       Date Indicator
# 1 12345 01/01/2008         1
# 2 12345 01/31/2008        NA
# 3 12345 02/28/2009        NA
# 4 54321 12/01/2008         1
# 5 54321 12/20/2008        NA

score 0 · Answer 7 · answered Oct 04 '18 at 09:07

a join is what you are looking for. If the table which you plan to keep as reference is on the left then its a left join. A sample code

    df1<-data.frame(ID=c(12345,54321) ,Date  =c('01/01/2008',' 12/01/2008 ')   ,    
     Indicator=c(1,1))

     df2<-data.frame(ID=c(12345,12345,5341) ,Date  =c('01/01/2008',' 12/01/2008 
      ','12/1/2008') )

    merge(df1,df2,by.x = 'ID',by.y='ID')

      ID     Date.x       Indicator       Date.y
      12345 01/01/2008         1    01/01/2008
      12345 01/01/2008         1    12/01/2008

So only the rows from df1 that are present in df 2 are part of the output

Merge data frames by exact match of column

7 Answers7