0

I have two dataframes df1 and df2. There's a 1:n-relationship between these two dataframes with df2 having multiple entries for each entry in df1.

My goal is to merge these two dataframes so that all the rows in df1 are repeated and that all the columns in df2 are imported. The foreign key (FK) in df2 refers to the primary key (PK) in df1

# example data:
df1 <- data.frame(PK = c(1,2,3,4,5),
                  varA = sample(5, replace=T),
                  varB = sample(5, replace=T))

df2 <- data.frame(FK = c(1,1,2,2,3,3,4,4,4),
                  varC = seq(9),
                  varD = seq(9))

The dataframe should be like:

PK | FK | varA | varB | varC | varD
1    1    ...    ...    1      1
1    1    ...    ...    2      2
2    2    ...    ...    3      3
2    2    ...    ...    4      4
3    3    ...    ...    5      5
3    3    ...    ...    6      6
4    4    ...    ...    7      7
4    4    ...    ...    8      8
4    4    ...    ...    9      9
5    NA   ...    ...    NA     NA

All the entries in df1 should be there, even if there is no corresponding entry in df2 (varC and varD should be NA then)

D. Studer
  • 1,711
  • 1
  • 16
  • 35
  • With `dplyr` package : `df1 %>% left_join(df2, by = c("PK", "FK"))` maybe ? – Mostafa90 Mar 22 '19 at 15:31
  • do you need `merge(df1, df2)` ? – Ronak Shah Mar 22 '19 at 15:32
  • Also, your desired output doesn't match. `varC` and `varD` are a sequence from 1:9 in your example data. Also, when you use `sample`, please include `set.seed` so we can reproduce your data – divibisan Mar 22 '19 at 15:34
  • @DimitriPetrenko : Error: `by` can't contain join column `FK` which is missing from LHS – D. Studer Mar 22 '19 at 15:41
  • @divibisan you are right, I corrected the desired output-dataframe above. There is no need to include `set.seed` as the values do not matter here. – D. Studer Mar 22 '19 at 15:44
  • @D.Studer You're right that it doesn't really matter here, but it's a good practice. It's nice to be able to show that the results we get with our answers match your desired answers exactly and makes it much easier to troubleshoot issues with them. – divibisan Mar 22 '19 at 15:45
  • 1
    sorry, `df1 %>% left_join(df2, by = c("PK" = "FK"))` or `left_join(df1, df2, by = c("PK" = "FK"))` – Mostafa90 Mar 22 '19 at 15:49

1 Answers1

1

You can do this with the native R function merge() https://www.statmethods.net/management/merging.html

or use dplyr, which I personally prefer due to the more intuitive sql-like syntax: https://dplyr.tidyverse.org/reference/join.html?q=inner%20_%20j#join-types

Since you are referring to "PK" and "FK" you may prefer the second option, but will need to install and load the dplyr package first using commands

install.packages("dplyr")
library(dplyr)
Simon Stolz
  • 207
  • 2
  • 7
  • `merge()` is sort of SQL `UNION` , `dplyr::left_join()` is similar to SQL `LEFT JOIN`. They are for not the same case of use. – phili_b Mar 22 '19 at 15:37