I am currently trying to join several dataframes with a primary key (i.e. ID column) in R studio
library(dplyr)
library(tidyr)
#DATAFRAME (1)
ID <- c(1, 2, 3, 4, 5, 6, 7, 8)
V1 <- c(9, 3, 6, 7, 1, 5, 8, 3)
V2 <- c(7, 5, 9, 2, NA, 3, 7, NA)
DF1 <- data.frame(ID, V1, V2)
#DATAFRAME (2)
ID <- c(1, 2, 2, 3, 3, 4, 5)
V3 <- c(1, 2, 'surcharge', 2, 'surcharge', 1, 1)
V4 <- c(400, 450, 4, 450, 4, 400, 400)
DF2 <- data.frame(ID, V3, V4)
#DATAFRAME (3)
ID <- c(1, 2, 2, 3, 3, 9, 10)
V5 <- c('credit', 'credit', 'credit', 'credit', 'credit', 'cash', 'cash')
DF3 <- data.frame(ID, V5)
When I perform a simple left_join of DF1 and DF2:
JoinDF <- left_join(DF1, DF2, by = "ID")
All hunky dory so far
ID V1 V2 V3 V4
1 1 9 7 1 400
2 2 3 5 2 450
3 2 3 5 surcharge 4
4 3 6 9 2 450
5 3 6 9 surcharge 4
6 4 7 2 1 400
7 5 1 NA 1 400
8 6 5 3 <NA> NA
9 7 8 7 <NA> NA
10 8 3 NA <NA> NA
My problems arise when trying to join a third DF (DF3) to the new DF (JoinDF)
JoinDF2 <- left_join(x = JoinDF, y = DF3, by = "ID")
ID V1 V2 V3 V4.x V4.y
1 1 9 7 1 400 credit
2 2 3 5 2 450 credit
3 2 3 5 2 450 credit
4 2 3 5 surcharge 4 credit
5 2 3 5 surcharge 4 credit
6 3 6 9 2 450 credit
7 3 6 9 2 450 credit
8 3 6 9 surcharge 4 credit
9 3 6 9 surcharge 4 credit
10 4 7 2 1 400 <NA>
11 5 1 NA 1 400 <NA>
12 6 5 3 <NA> NA <NA>
13 7 8 7 <NA> NA <NA>
14 8 3 NA <NA> NA <NA>
Which creates two problems for me:
'Duplicate' entries for IDs 2 and 3
Removals of IDs 9 and 10, for which there is no match in DF1 or DF2
Re. problem 1, I realise R is doing right by me creating all possible combinations of the data when doing a left_join.
This makes me suspect I need to use reshape or cast, to remove those duplicate IDs from the DF2 and DF3 before joining them. Specifically, V3 (the pesky surcharge). However, I am fairly new to R and am somewhat at a loss as to how to code this
If anyone has any suggestions, I would be greatly appreciative