How to merge two dataframes with different length and one is a subset of the other

Question

I have two data frames, df1 and df2. df1 contains a member_ID and a result_id df2 contains a member_ID and a result_id

However, in df1 I have 1000 rows, in df2 I have 1200 rows. The df1 contains every row in df2 plus 200 other rows.

I need df3, containing member_ID and result_ID, which contains all 1200 rows, and if there is a result_id in df1, then the value of df1 should be used, while if there is no result_id in df1 (for the remaining 200 rows), the result_ID in df2 should be used.

When merged, the df will consist of the 1000 matching rows only.

Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — Sotos, Nov 23 '17 at 09:28

Barbara · Accepted Answer · 2017-11-23T10:26:15.067

0

You can use the dplyr package and do a anti_join to get the additional results contained in df2and then do a rbind to bind all in one data frame:

library(dplyr)
df_tmp <- anti_join(df1 ,df2, by=member_ID)
df3 <-rbind(df_tmp, df1)

edited Nov 23 '17 at 10:26

answered Nov 23 '17 at 09:28

Barbara

1,118
2
11
34

Thanks, it worked. I have an additional question, same data frames, df1 contains 1000 rows, df2 contains 1200 rows. Let us say that df1 is not exactly a subset of df2 (it turned out to be the case so I have more than 1200 rows now. I need to create df3 with 1200 rows, if the member_ID of df2 = member_ID of df1, then use result_ID of df1, if there is no match, then use the result_ID of df2, but I need members from df2 only. – Tilsight Nov 23 '17 at 09:56
@Tilsight I think it should work anyway because you are taking all `df1` and the differences between `df1` and `df2`. To make sure that there are no duplicates you could add an additional step at the end and do: `df3 <- unique(df3)` – Barbara Nov 23 '17 at 09:59
Thanks, Barbara, but unfortunately there was no duplication. I am checking the anti join now. – Tilsight Nov 23 '17 at 10:05
@Tilsight I have edited the answer now, try with `df_tmp <- anti_join(df1 ,df2, by=member_ID)` – Barbara Nov 23 '17 at 10:26
I understand the logic now. We take the "difference" by member ID to get those who do not appear in df1 and add it later to df1. It is getting better but still not perfect. I will get back to you what was wrong. – Tilsight Nov 23 '17 at 10:49
Can I just replace result_ID in df2 with result_ID in df1 for maching member_ID somehow? Like if member_ID of df1 = member_ID of df2 THEN result_ID of df 1 => result_ID of df2, ELSE do nothing. – Tilsight Nov 23 '17 at 10:53
@Tilsight what is wrong with the results? – Barbara Nov 23 '17 at 10:57
It still returns more rows than 1200. I looked up the "duplication" with outer, left, right, etc merge, tried all but they also returned different results. And I can not ask more questions because the clever boys marked as a duplicate while it is clearly not.... thanks ... – Tilsight Nov 23 '17 at 12:33
@Tilsight could you post a `dput()` of your data frames? Another thing you could try is this `df_tmp <- anti_join(df1 ,df2, by=member_ID) df_tmp2 <- inner_join(df1 ,df2, by=member_ID) df3 <- rbind(df_tmp, df_tmp2)` – Barbara Nov 23 '17 at 12:39

How to merge two dataframes with different length and one is a subset of the other

1 Answers1