2

I have two different data frames composed of strings (several thousands of rows). They each have a different number of entries. I am trying to find a code that will allow me to associate the content of the column "characteristic" in df2 into df1 (see example below) :

Df1 :

ID Information
11AA info1
22BB info2
33CC info3
44DD info4
11AA info1
22BB info2

Df2:

ID Characteristic
11AA char1
22BB char2
33CC char3
44DD char4

The intended result would be :

Df3:

ID Information Characteristic
11AA info1 char1
22BB info2 char2
33CC info3 char3
44DD info4 char4
11AA info1 char1
22BB info2 char2

I tried using inner_join(df1,df2, by="ID") but the resulting dataframe (df3) often has more rows than the original (df1) and I need that the resulting dataframe (df3) keeps the same structure than the original one (df1), I just need to add the extra column.

Is there a built in function for this type of operation?

Juan
  • 21
  • 3
  • 1
    The presence of more rows suggests that you have duplicate `ID` in either one of your input frames. That does not occur with your sample data, so you won't get it. If you want to remove the duplicates, you need to provide (to us and to R) some logic on how to do that, there is not general way to always do that. – r2evans Aug 20 '21 at 17:55

2 Answers2

3
merge(Df1, Df2, by = "ID")
#     ID Information Characteristic
# 1 11AA       info1          char1
# 2 22BB       info2          char2
# 3 22BB       info2          char2
# 4 33CC       info3          char3
# 5 44DD       info4          char4

If you need non-matches too (since your data has 11AB in one and not the other), then use all=TRUE (or all.x= or all.y, your choice).

merge(Df1, Df2, by = "ID", all=TRUE)
#     ID Information Characteristic
# 1 11AA       info1          char1
# 2 11AB       info1           <NA>
# 3 22BB       info2          char2
# 4 22BB       info2          char2
# 5 33CC       info3          char3
# 6 44DD       info4          char4

Data

Df1 <- structure(list(ID = c("11AA", "22BB", "33CC", "44DD", "11AB", "22BB"), Information = c("info1", "info2", "info3", "info4", "info1", "info2")), class = "data.frame", row.names = c(NA, -6L))
Df2 <- structure(list(ID = c("11AA", "22BB", "33CC", "44DD"), Characteristic = c("char1", "char2", "char3", "char4")), class = "data.frame", row.names = c(NA, -4L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Actually, I was thinking it's a dupe of https://stackoverflow.com/q/1299871/3358272. I'm waiting for them to come back with anything that explains the "more rows" problem. – r2evans Aug 20 '21 at 18:11
  • Thank you so much for your answer. Actually the 11AB entry was a typo, I'm going to edit the example so it will be clear. I tried the merge() function but i still get more rows than the original dataframe. However it is entirely possible that there are duplicates in the original dataframe, which I should not remove. I also think I previously saw the post you're linking here. Is there a way I could compare my resulting dataframe with the original one to see which are the rows that are being added ? – Juan Aug 20 '21 at 18:29
  • Juan, your updated data and your `inner_join` code produces 6 rows, so there is no evidence there of your "extra rows". – r2evans Aug 28 '21 at 20:54
2

You are looking for an outer join.

library(dplyr)

df3<- left_join(df1, df2, by = c("ID"))
d3hero23
  • 380
  • 1
  • 12
  • An "outer join" would be `full_join`, not `left_join`. Some good references for merges/join (both in R and SQL) are: https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272 – r2evans Nov 23 '22 at 13:04