Match factors in merged dataframes

Question

I have dataframes DF1 and DF2:

DF1 <- data.frame(color = factor(c("Blue", "Brown", "Blue", "Brown", "Blue", "Brown", "Blue", "Brown")),
                 location = factor(c("California", "Nevada", "Nevada", "California", "Arizona", "Arizona", "California", "Nevada")),
                 respondent = factor(c("R1", "R2", "R3", "R4", "R5", "R6", "R7", "R8")))
DF2 <- data.frame(shape = factor(c("Square", "Square", "Round", "Square", "Round", "Round", "Square", "Round", "Square", "Square")),
                 location = factor(c("California", "Nevada", "Arizona", "California", "California", "Arizona", "California", "Nevada", "Arizona", "California")),
                 respondent = factor(c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10")))

I want to merge them and create dataframe DF3, and would like to "match" the values of the factor location from the two original dataframes.

The following results in a dataframe in a format I want, but with two different location factors (location.y and location.x):

DF3 <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), 
      by = 0, all = TRUE)[-1]

I would like to avoid that, and have only one location factor with the locations from both dataframes matched, but merging the dataframes by location results in duplication of values in DF3:

DF3 <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), 
          by = "location", all = TRUE)[-1]

Is there a better way to merge the dataframes?

score 1 · Accepted Answer · edited Oct 28 '21 at 17:03

You can try this (in by you must specify the keys):

#Merge
DF3 <- merge(DF1,DF2,by='location',all=T,suffixes = c('df1','df2'))

     location color respondentdf1  shape respondentdf2
1     Arizona  Blue            R5  Round            A3
2     Arizona  Blue            R5 Square            A9
3     Arizona  Blue            R5  Round            A6
4     Arizona Brown            R6  Round            A3
5     Arizona Brown            R6 Square            A9
6     Arizona Brown            R6  Round            A6
7  California  Blue            R1 Square            A1
8  California  Blue            R1 Square           A10
9  California  Blue            R1 Square            A7
10 California  Blue            R1 Square            A4
11 California  Blue            R1  Round            A5
12 California Brown            R4 Square            A1
13 California Brown            R4 Square           A10
14 California Brown            R4 Square            A7
15 California Brown            R4 Square            A4
16 California Brown            R4  Round            A5
17 California  Blue            R7 Square            A1
18 California  Blue            R7 Square           A10
19 California  Blue            R7 Square            A7
20 California  Blue            R7 Square            A4
21 California  Blue            R7  Round            A5
22     Nevada Brown            R2 Square            A2
23     Nevada Brown            R2  Round            A8
24     Nevada  Blue            R3 Square            A2
25     Nevada  Blue            R3  Round            A8
26     Nevada Brown            R8 Square            A2
27     Nevada Brown            R8  Round            A8

Update I believe you must consider location and respondent so you can try this:

library(dplyr)
#Create keys
DF0 <- rbind(DF1[,c('location','respondent'),drop=F],DF2[,c('location','respondent'),drop=F])
DF0 <- DF0[!duplicated(paste(DF0$location,DF0$respondent)),]

DF0 %>% left_join(DF1) %>% left_join(DF2)

     location respondent color  shape
1  California         R1  Blue   <NA>
2      Nevada         R2 Brown   <NA>
3      Nevada         R3  Blue   <NA>
4  California         R4 Brown   <NA>
5     Arizona         R5  Blue   <NA>
6     Arizona         R6 Brown   <NA>
7  California         R7  Blue   <NA>
8      Nevada         R8 Brown   <NA>
9  California         A1  <NA> Square
10     Nevada         A2  <NA> Square
11    Arizona         A3  <NA>  Round
12 California         A4  <NA> Square
13 California         A5  <NA>  Round
14    Arizona         A6  <NA>  Round
15 California         A7  <NA> Square
16     Nevada         A8  <NA>  Round
17    Arizona         A9  <NA> Square
18 California        A10  <NA> Square

Thanks, but this is exactly what I'm trying to avoid: here values from the original dataframes are duplicated, with three rows for every row in the original dataframes. — KaC, Jul 10 '20 at 15:05
The issue is that `location` has duplicates so that is why you have more rows. You could check `DF3 <- DF3[!duplicated(DF3$location),]` if you need to remove duplicates. — Duck, Jul 10 '20 at 15:10
Yes, but that removes not only the values duplicated during the merge, but all but one value of each individual location. — KaC, Jul 10 '20 at 15:15
@KaC I have included an updated code. Please check and let me know if this works! — Duck, Jul 10 '20 at 15:17
Thank you, @Duck. This is probably the best way to do it. The other option I've been thinking about is `DF3 <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = 0, all = TRUE)[-1]` and then combining `location.x` and `location.y` into a single factor, but that might not be ideal for subsequent analysis. — KaC, Jul 10 '20 at 15:20

Match factors in merged dataframes

1 Answers1