Matching two columns in one data frame against two columns in another, independently of the order

Question

Given two data frames as example:

df1 = data.frame(V1 = c('JOHN', 'BRIAN','KATE', 'ERIC', 'CURT', 'ZACH'), V2 = c('ABIGAIL', 'ANDY', 'GEORGE', 'JOHN', 'MARY', 'FRANKLIN'), V3 = seq(1,6,1), V4 = seq(1,6,1))
df2 = data.frame(V1 = c('ABIGAIL', 'BRIAN','KATE', 'ERIC'), V2 = c('JOHN', 'ANDY', 'MARTIN', 'ANDREW'))

df1
V1       V2 V3 V4
JOHN  ABIGAIL  1  1
BRIAN     ANDY  2  2
KATE   GEORGE  3  3
ERIC     JOHN  4  4
CURT     MARY  5  5
ZACH FRANKLIN  6  6

df2
V1     V2 
ABIGAIL   JOHN
BRIAN   ANDY
KATE MARTIN
ERIC ANDREW

How can I create a new df3 containing the rows that match between columns V1and V2 in df1and df2. The challenge is that I want this match independently of the order.

So that df3would in the example look like:

  df3
  V1   V2 V3 V4
  ABIGAIL JOHN 1 1
  BRIAN ANDY  2  2

I have tried using match() and the %in% operator without any luck.

I am looking for a solution applicable a data frame with several thousand rows.

EDIT

Both answers gave a working solution to my question. It turns out I missed something to make it applicable to my own data. Instead given the data frames:

df1 = data.frame(V1 = c('JOHN', 'BRIAN','KATE', 'ERIC', 'CURT', 'ZACH'), V2 = c('ABIGAIL', 'ANDY', 'GEORGE', 'JOHN', 'MARY', 'FRANKLIN'), V3 = seq(1,6,1), V4 = seq(1,6,1))
df2 = data.frame(V1 = c('ABIGAIL', 'BRIAN','KATE', 'BRIAN', 'ERIC'), V2 = c('JOHN', 'ANDY', 'MARTIN', 'ANDY', 'ANDREW'))

Now I also want to extract the rows even if the match between them are not in the same row. df3would then look like:

df3
V1   V2 V3 V4
ABIGAIL JOHN 1 1
BRIAN ANDY  2  2
BRIAN ANDY  2  2

Basically I want the values in V3 and V4 from df1 for every interactions between the names in V1and V2independently of the order.

@akrun I missed some important info that I have added know. I don't think I can use inner_join in this case. — locoto, Feb 18 '18 at 17:53

score 3 · Accepted Answer · answered Feb 19 '18 at 16:27

Make ID by sorting the columns V1 and V2 rowwise using pmin and pmax, so that "A" "B" and "B" "A" will have same ID as "A_B". Then use merge by ID:

df1$ID <- paste(pmin(as.character(df1$V1), as.character(df1$V2)),
                pmax(as.character(df1$V1), as.character(df1$V2)), sep = "_")

df2$ID <- paste(pmin(as.character(df2$V1), as.character(df2$V2)),
                pmax(as.character(df2$V1), as.character(df2$V2)), sep = "_")

merge(df1, df2[, "ID", drop = FALSE], by = "ID")

#             ID    V1      V2 V3 V4
# 1 ABIGAIL_JOHN  JOHN ABIGAIL  1  1
# 2   ANDY_BRIAN BRIAN    ANDY  2  2
# 3   ANDY_BRIAN BRIAN    ANDY  2  2

IceCreamToucan · Answer 2 · 2018-02-19T03:24:03.600

library(combinat)
df1[apply(df1[,1:2], 1, paste, collapse = ' ') %in%
    apply(df2[,1:2], 1, function(x) sapply(permn(x), paste, collapse = ' '))
    ,]

Using sqldf as below is faster, but doesn't work as easily for > 2 variables since all possibilities must be listed in the join condition.

library(sqldf)
sqldf('
select  df1.*
from    df1
        inner join df2
            on  (df1.V1 = df2.V1 and df1.V2 = df2.V2)
                or (df1.V1 = df2.V2 and df1.V2 = df2.V1)
')

ThatGuy · Answer 3 · 2018-02-19T03:34:58.773

You don't need a package; just test for both ways with:

## Make a frame with combined in both ways
df1o = data.frame(V12 =paste(df1$V1,df1$V2, sep="-"), V21= paste(df1$V2,df1$V1, sep="-"))
## Make a frame with the second combination
df2o = data.frame(P =paste(df2$V1,df2$V2, sep="-") )
## Compare the combinations in both ways and select those that match in a new df
df3 = df1[which(df1o$V12 %in% df2o$P | df1o$V21 %in% df2o$P),]
df3

Results in

> df3
     V1      V2 V3 V4
1  JOHN ABIGAIL  1  1
2 BRIAN    ANDY  2  2Z

Matching two columns in one data frame against two columns in another, independently of the order

EDIT

3 Answers3