1

I am currently trying to join several dataframes with a primary key (i.e. ID column) in R studio

library(dplyr)
library(tidyr)

#DATAFRAME (1)
ID <- c(1, 2, 3, 4, 5, 6, 7, 8)
V1 <- c(9, 3, 6, 7, 1, 5, 8, 3)
V2 <- c(7, 5, 9, 2, NA, 3, 7, NA)
DF1 <- data.frame(ID, V1, V2)

#DATAFRAME (2)
ID <- c(1, 2, 2, 3, 3, 4, 5)
V3 <- c(1, 2, 'surcharge', 2, 'surcharge', 1, 1)
V4 <- c(400, 450, 4, 450, 4, 400, 400)
DF2 <- data.frame(ID, V3, V4)

#DATAFRAME (3)
ID <- c(1, 2, 2, 3, 3, 9, 10)
V5 <- c('credit', 'credit', 'credit', 'credit', 'credit', 'cash', 'cash')
DF3 <- data.frame(ID, V5)

When I perform a simple left_join of DF1 and DF2:

JoinDF <- left_join(DF1, DF2, by = "ID")

All hunky dory so far

   ID V1 V2        V3  V4
1   1  9  7         1 400
2   2  3  5         2 450
3   2  3  5 surcharge   4
4   3  6  9         2 450
5   3  6  9 surcharge   4
6   4  7  2         1 400
7   5  1 NA         1 400
8   6  5  3      <NA>  NA
9   7  8  7      <NA>  NA
10  8  3 NA      <NA>  NA

My problems arise when trying to join a third DF (DF3) to the new DF (JoinDF)

JoinDF2 <- left_join(x = JoinDF, y = DF3, by = "ID")


   ID V1 V2        V3 V4.x   V4.y
1   1  9  7         1  400 credit
2   2  3  5         2  450 credit
3   2  3  5         2  450 credit
4   2  3  5 surcharge    4 credit
5   2  3  5 surcharge    4 credit
6   3  6  9         2  450 credit
7   3  6  9         2  450 credit
8   3  6  9 surcharge    4 credit
9   3  6  9 surcharge    4 credit
10  4  7  2         1  400   <NA>
11  5  1 NA         1  400   <NA>
12  6  5  3      <NA>   NA   <NA>
13  7  8  7      <NA>   NA   <NA>
14  8  3 NA      <NA>   NA   <NA>

Which creates two problems for me:

  1. 'Duplicate' entries for IDs 2 and 3

  2. Removals of IDs 9 and 10, for which there is no match in DF1 or DF2

Re. problem 1, I realise R is doing right by me creating all possible combinations of the data when doing a left_join.

This makes me suspect I need to use reshape or cast, to remove those duplicate IDs from the DF2 and DF3 before joining them. Specifically, V3 (the pesky surcharge). However, I am fairly new to R and am somewhat at a loss as to how to code this

If anyone has any suggestions, I would be greatly appreciative

perkot
  • 121
  • 1
  • 10
  • 1
    Aye. You do indeed need to make some decisions. Which "2" do you want from `DF2`? The one with `V3` == 2 or `surcharge`? What are the valid/invalid values for the columns that would be rules to filter out various rows? You need that defined before you can do anything else worth doing with this data. – hrbrmstr Jan 17 '17 at 05:33
  • 1
    I would like to keep both. Concerning V3 and V4, as it applies to my real-life set, V3_2 represents item cost, V3_surcharge represents an additional credit card fee that will only apply in some instances. Thus, 'duplicates' always represent a credit card fee. I could potentially add the surcharge to the item cost. This would be one way to maintain a single ID. It would be nice to have an additional column indicating where a surcharge is present however – perkot Jan 17 '17 at 05:47

0 Answers0