How to specify names of columns for x and y when joining in dplyr?

Question

I have two data frames that I want to join using dplyr. One is a data frame containing first names.

test_data <- data.frame(first_name = c("john", "bill", "madison", "abby", "zzz"),
                        stringsAsFactors = FALSE)

The other data frame contains a cleaned up version of the Kantrowitz names corpus, identifying gender. Here is a minimal example:

kantrowitz <- structure(list(name = c("john", "bill", "madison", "abby", "thomas"), gender = c("M", "either", "M", "either", "M")), .Names = c("name", "gender"), row.names = c(NA, 5L), class = c("tbl_df", "tbl", "data.frame"))

I essentially want to look up the gender of the name from the test_data table using the kantrowitz table. Because I'm going to abstract this into a function encode_gender, I won't know the name of the column in the data set that's going to be used, and so I can't guarantee that it will be name, as in kantrowitz$name.

In base R I would perform the merge this way:

merge(test_data, kantrowitz, by.x = "first_names", by.y = "name", all.x = TRUE)

That returns the correct output:

  first_name gender
1       abby either
2       bill either
3       john      M
4    madison      M
5        zzz   <NA>

But I want to do this in dplyr because I'm using that package for all my other data manipulation. The dplyr by option to the various *_join functions only lets me specify one column name, but I need to specify two. I'm looking for something like this:

library(dplyr)
# either
left_join(test_data, kantrowitz, by.x = "first_name", by.y = "name")
# or
left_join(test_data, kantrowitz, by = c("first_name", "name"))

What is the way to perform this kind of join using dplyr?

(Never mind that the Kantrowitz corpus is a bad way to identify gender. I'm working on a better implementation, but I want to get this working first.)

You can't currently, but it's on the to do list: https://github.com/hadley/dplyr/issues/177 — hadley, Feb 19 '14 at 20:03

score 212 · Accepted Answer · answered Oct 17 '14 at 19:44

212

This feature has been added in dplyr v0.3. You can now pass a named character vector to the by argument in left_join (and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:

left_join(test_data, kantrowitz, by = c("first_name" = "name"))

answered Oct 17 '14 at 19:44

Lincoln Mullen

6,257
4
27
30

17

**edit** This works in the general case as well : `left_join(data_a, data_b, by = c("a.first" = "b.first", "a.second" = "b.second", "a.third" = "b.third"))` ? – davidski Feb 25 '16 at 14:35
1

The `by =` is optional. You can do `left_join(test_data, kantrowitz, c("first_name" = "name"))` – Pranay Aryal Apr 04 '17 at 15:03
16

That's true of any argument to a function. But I generally find it better to be explicit by using named arguments rather than position matching in this case. – Lincoln Mullen Apr 04 '17 at 18:28
1

Some users may suggest updating the syntax to use join_by(). The following is the equivalent: left_join(test_data, kantrowitz, by = join_by(first_name == name)) – etrowbridge Feb 09 '23 at 08:43

Sven Hohenstein · Answer 2 · 2014-07-15T05:54:59.630

5

This is more a workaround than a real solution. You can create a new object test_data with another column name:

left_join("names<-"(test_data, "name"), kantrowitz, by = "name")

     name gender
1    john      M
2    bill either
3 madison      M
4    abby either
5     zzz   <NA>

edited Jul 15 '14 at 05:54

answered Feb 19 '14 at 18:22

Sven Hohenstein

80,497
17
145
168

Renaming induces a copy, I think, which may be way dplyr avoids it and makes you do it instead. – joran Feb 19 '14 at 18:35
2

In 0.1.2 you'll at least be able to do `select(test_data, first_name = name)` and that will only make a shallow copy. – hadley Feb 19 '14 at 20:04
1

Use `data.table::setnames`? – Hugh Apr 10 '14 at 22:14
2

the solution select(test_data, first_name = name) does not work as of June 2014 – userJT Jun 20 '14 at 14:18

How to specify names of columns for x and y when joining in dplyr?

2 Answers2

Linked

Related