154

I realize that dplyr v3.0 allows you to join on different variables:

left_join(x, y, by = c("a" = "b") will match x.a to y.b

However, is it possible to join on a combination of variables or do I have to add a composite key beforehand?

Something like this:

left_join(x, y, by = c("a c" = "b d") to match the concatenation of [x.a and x.c] to [y.b and y.d]

Mus
  • 7,290
  • 24
  • 86
  • 130
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116

2 Answers2

279

Updating to use tibble()

You can pass a named vector of length greater than 1 to the by argument of left_join():

library(tidyverse)

set.seed(0)

d1 <- tibble(
  x = letters[1:3],
  y = LETTERS[1:3],
  a = rnorm(3)
  )

d2 <- tibble(
  x2 = letters[3:1],
  y2 = LETTERS[3:1],
  b = rnorm(3)
  )

left_join(d1, d2, by = c("x" = "x2", "y" = "y2"))
Mark
  • 7,785
  • 2
  • 14
  • 34
davechilders
  • 8,693
  • 2
  • 18
  • 18
  • 8
    Thanks for this; also works when the columns in the data frames have the same name, e.g. `left_join(d1, d2, by = c("firstname" = "firstname", "lastname" = "lastname"))`. May not be obvious to some. – Anthony Simon Mielniczuk Jan 27 '18 at 14:41
  • 31
    When the join columns are the same, you can also avoid the `=`: `left_join(d1, d2, by = c("firstname", "lastname"))` – davechilders Jan 27 '18 at 19:06
  • 5
    Ooof... I was holding out home, but... this appears to be an AND... which I suppose makes sense but I was hoping it'd be an x=x2 OR y=y2, as I have multiple indexes built to try to identify duplicate but damaged entries across disparate resources. – Joshua Eric Turcotte Aug 21 '18 at 18:48
  • The names don't have to be the same, the should just be valid column names in the corresponding dataframe i.e one can have a column "fname" and the other "firstname" and will work just fine. – San Emmanuel James Jul 12 '20 at 02:10
2

As of May 2022, we now also have the option of using join_by(), which, in addition to allowing joining by specific columns, like in Dave's answer, allows a variety of other ways of joining two dataframes.

We can use:

  • Equality condition: ==
  • Inequality conditions: >=, >, <=, or <
  • Rolling helper: closest()
  • Overlap helpers: between(), within(), or overlaps()

Examples:

# first we create a dataset similar to Dave's one, but with a few more columns, which make it easy to demonstrate the other joins
library(tidyverse)
set.seed(0)

dfx <- tibble(
  id = 1:3,
  first_name = c("Alice", "Bob", "Charlie"),
  last_name = c("Adams", "Baker", "Chaplin"),
  a = rnorm(3),
  lb = 0.25,
  ub = 0.75)

dfy <- tibble(
  id = 1:3,
  first_name = c("Alice", "Bob", "Charlie"),
  last_name = c("Adams", "Baker", "Chaplin"),
  b = rnorm(3),
  other_range = 0,
  other_range2 = 1)

Equality join (what OP asked for):

left_join(dfx, dfy, join_by(id, first_name, last_name == last_name))

Note: if the names of the columns you want to join by are the same in both dataframes, you don't need to do col == col, you can just use col, as in the first two columns in the above example.

Inequality join:

left_join(dfx, dfy, join_by(a < b)) # join the rows where a < b

Rolling joins:

left_join(dfx, dfy, join_by(closest(a < b))) # similar to above, but only take the closest match

Overlap joins:

left_join(dfx, dfy, join_by(between(a, other_range, other_range2))) # join rows where a is between other_range and other_range2

left_join(dfx, dfy, join_by(overlaps(lb, ub, other_range, other_range2))) # join rows where the the ranges (lb to ub, and other_range to other_range2) overlap

left_join(dfx, dfy, join_by(within(lb, ub, other_range, other_range2))) # join rows where lb to ub is within other_range to other_range2

Another note: join_by() assumes you'll list columns for the left column before the right one. If you don't want to do that for some reason, use x$ for the left dataframe, and y$ for the right dataframe, e.g. join_by(x$a < y$b).

For more information, read the documentation.

Update:

I realised I never actually addressed the nub of OP's question:

Something like this: left_join(x, y, by = c("a c" = "b d")

You can't do that specifically, because dplyr expects every string to be the name of a column. However, if you have two strings which contain columns separated by spaces, you could do something like this:

j1 <- "id first_name last_name"
j2 <- j1 # let's pretend for the sake of argument they are different, as it doesn't change the answer

join_vec <- function(j1, j2) {
    setNames(str_split(j2, " ")[[1]], str_split(j1, " ")[[1]])
}

left_join(dfx, dfy, by = join_vec(j1, j2))
Mark
  • 7,785
  • 2
  • 14
  • 34