1

I'm trying to figure out how to merge/join two data frames in such a way that if a certain condition is met, R combines two rows from the two data frames into one row, but if the condition is not met, R adds a new row with NAs in the columns that are not present in the original data frame. I'm unclear if this is a simpler join than I think it is, but I've been unable to figure out how to do this, even after reading through some stack overflow results (e.g. or e.g.).

Below are two example data frames:

df1 <- data.frame(Name_df1 = c("Alan", "Steve", "Melanie", "Steve", "Melanie"),
                  Date_df1 = c("02/18/2008", "02/18/2008", "03/14/2009", "04/19/2009", "03/16/2009"),
                  Job_df1 = c("Cook", "Security", "Greeter", "Security", "Greeter"),
                  Hours_df1 = c(8.5, 7.0, 6.0, 7.0, 6.0))

df2 <- data.frame(Name_df2 = c("Steve", "Alan", "Melanie", "Melanie", "Steve", "Carter"),
                  Date_df2 = c("02/18/2008", "02/18/2008", "03/14/2009", "03/17/2009","04/25/2009", "08/15/2011"),
                  Job_df2 = c("Police", "Chef", "Greeter", "Greeter", "Security", "Doorman"),
                  Hours_df2 = c(8.5, 7.0, 6.0, 7.0, 7.0, 6.5),
                  Wage_df2 = c(80, 77, 127.5, 90, 145, 100))

df1 and df2 both describe dates on which certain individuals worked, their job titles, names, and hours worked on the specified date, and only df2 also contains wage information. What I want to do is join df1 and df2 into a df3 with the following columns:

  • "Name_df1"
  • "Name_df2"
  • "Date_df1"
  • "Date_df2"
  • "Job_df1"
  • "Job_df2"
  • "Hours_df1"
  • "Hours_df2"
  • "Wage_df2"

Specifically, I want to join df1 and df2 such that for every row in df1 or df2, if there is an exact match for both the value in the "Name_df1"/"Name_df2" column and the value in the "Date_df1"/"Date_df2" column in the corresponding column in the opposite data frame (i.e. in df1, if the value in the "Date_df1" column and "Name_df1" columns each have an exact match in the "Date_df2" column and "Name_df2" column, respectively, of df2):

  • A row is created in df3 that has the "Name_df1", "Name_df2", "Date_df1", "Date_df2", "Job_df1", "Job_df2", "Hours_df1", "Hours_df2", and "Wage" columns filled in with the relevant values from df1 and df2. The values in the "Job_df1", "Job_df2", "Hours_df1", and "Hours_df2", may or may not be the same.

However, if there is not an exact match for both the value in the "Name_df1"/"Name_df2" column and the value in the "Date_df1"/"Date_df2" column, then df3 should get a row that includes only the information from df1 in "Name_df1", "Date_df1", "Job_df1", and "Hours_df1", and NAs in "Name_df2", "Date_df2", "Job_df2", "Hours_df2", and "Wage_df2". Or, if the original row is from df2, then df3 should get a row that includes only the information from df2 in "Name_df2", "Date_df2", "Job_df2", "Hours_df2", and "Wage_df2", and NAs in "Name_df1", "Date_df1", "Job_df1", and "Hours_df1".

I know that's a mouthful, but based on the df1 and df2 specified above, this is what df3 would look like:

df3 <- data.frame(Name_df1 = c("Alan", "Steve", "Melanie", "Steve", "Melanie", NA, NA, NA),
                  Name_df2 = c("Alan", "Steve", "Melanie", NA, NA, "Melanie", "Steve", "Carter"),
                  Date_df1 = c("02/18/2008", "02/18/2008", "03/14/2009", "04/19/2009", "03/16/2009", NA, NA, NA),
                  Date_df2 = c("02/18/2008", "02/18/2008", "03/14/2009", NA, NA, "03/17/2009", "04/25/2009", "08/15/2011"),
                  Job_df1 = c("Cook", "Security", "Greeter", "Security", "Greeter", NA, NA, NA),
                  Job_df2 = c("Chef", "Police", "Greeter", NA, NA, "Greeter", "Security", "Doorman"),
                  Hours_df1 = c(7.0, 7.0, 6.0, 7.0, 6.0, NA, NA, NA),
                  Hours_df2 = c(7.0, 8.5, 6.0, NA, NA, 7.0, 7.0, 6.5),
                  Wage_df2 = c(77.0, 80, 127.5, NA, NA, 90.0, 145.0, 100))

Could anyone help me with this?

ben_p_4370
  • 53
  • 5
  • 'How do I conditionally combine some rows (but not others)' You don't. A cross join gives all combinations of a row from each table & then you keep the ones that satisfy a condition. Read about basics of joining. – philipxy Feb 19 '21 at 03:40
  • I'm aware of cross joins, just unsure of what filtering to these specific conditions would look like – ben_p_4370 Feb 19 '21 at 04:47

3 Answers3

1

It seems you are looking for full_join but to get the exact desired output (df3) you need to perform some manipulation.

library(dplyr)

full_join(df1, df2, by = c('Name_df1' = 'Name_df2', 'Date_df1' = 'Date_df2')) %>%
  mutate(Name_df2 = ifelse(is.na(Job_df2), NA, Name_df1),
         Name_df1 = ifelse(is.na(Job_df1), NA, Name_df1), 
         Date_df2 = ifelse(is.na(Job_df2), NA, Date_df1),
         Date_df1 = ifelse(is.na(Job_df1), NA, Date_df1)) %>%
  select(starts_with('Name'), starts_with('Date'), starts_with('Job'), everything())

#  Name_df1 Name_df2   Date_df1   Date_df2  Job_df1  Job_df2 Hours_df1 Hours_df2 Wage_df2
#1     Alan     Alan 02/18/2008 02/18/2008     Cook     Chef       8.5       7.0     77.0
#2    Steve    Steve 02/18/2008 02/18/2008 Security   Police       7.0       8.5     80.0
#3  Melanie  Melanie 03/14/2009 03/14/2009  Greeter  Greeter       6.0       6.0    127.5
#4    Steve     <NA> 04/19/2009       <NA> Security     <NA>       7.0        NA       NA
#5  Melanie     <NA> 03/16/2009       <NA>  Greeter     <NA>       6.0        NA       NA
#6     <NA>  Melanie       <NA> 03/17/2009     <NA>  Greeter        NA       7.0     90.0
#7     <NA>    Steve       <NA> 04/25/2009     <NA> Security        NA       7.0    145.0
#8     <NA>   Carter       <NA> 08/15/2011     <NA>  Doorman        NA       6.5    100.0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you, this is very close to what I'm looking for. However, when a name and date pair appear in both df1 and df2, I want the Name_df1 and Name_df2 and Date_df1 and Date_df2 columns to all be populated in df3. So in the result of your code, I would want Name_df2 in row 1 to say "Alan" and Date_df2 in row 1 to say "2/18/2008." The same would apply to rows 2 and 3 (Steve and Melanie on dates 2/18/2008 and 3/14/2009, respectively), because those pairs match in the date and name columns in df1 and df2. Do you know how I could achieve that? – ben_p_4370 Feb 19 '21 at 04:50
  • 1
    Yes, you are right. That requires small change in the answer. See my updated answer @ben_p_4370 – Ronak Shah Feb 19 '21 at 05:01
  • This is perfect, thank you. Do you know how I can stop ifelse from turning my Dates into numeric (e.g.17266)? – ben_p_4370 Feb 19 '21 at 05:17
  • 1
    Instead of `ifelse` use `if_else`. – Ronak Shah Feb 19 '21 at 05:18
  • Perfect. Thanks for your help – ben_p_4370 Feb 19 '21 at 05:20
1

Maybe this could be a way to join those two data frames:

library(dplyr) df3 <- df1 %>% full_join(df2, by="Name", suffix= c(".df1", ".df2"))

Result:

Table

Julian
  • 11
  • 2
0

I think you want a full join with keep = TRUE :

library(dplyr)
full_join(df1, df2, by = c(Name_df1 = "Name_df2", Date_df1 = "Date_df2"), keep = TRUE)
#>   Name_df1   Date_df1  Job_df1 Hours_df1 Name_df2   Date_df2  Job_df2 Hours_df2
#> 1     Alan 02/18/2008     Cook       8.5     Alan 02/18/2008     Chef       7.0
#> 2    Steve 02/18/2008 Security       7.0    Steve 02/18/2008   Police       8.5
#> 3  Melanie 03/14/2009  Greeter       6.0  Melanie 03/14/2009  Greeter       6.0
#> 4    Steve 04/19/2009 Security       7.0     <NA>       <NA>     <NA>        NA
#> 5  Melanie 03/16/2009  Greeter       6.0     <NA>       <NA>     <NA>        NA
#> 6     <NA>       <NA>     <NA>        NA  Melanie 03/17/2009  Greeter       7.0
#> 7     <NA>       <NA>     <NA>        NA    Steve 04/25/2009 Security       7.0
#> 8     <NA>       <NA>     <NA>        NA   Carter 08/15/2011  Doorman       6.5
#>   Wage_df2
#> 1     77.0
#> 2     80.0
#> 3    127.5
#> 4       NA
#> 5       NA
#> 6     90.0
#> 7    145.0
#> 8    100.0

Created on 2022-01-17 by the reprex package (v2.0.1)

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167