dplyr join by exclusion?

Question

When using the various join functions from dplyr you can either join all variables with the same name (by default) or specify those ones using by = c("a" = "b"). Is there a way to join by exclusion? For example, I have 1000 variables in two data frames and I want to join them by 999 of them, leaving one out. I don't want to do by = c("a1" = "b1", ...,"a999" = "b999"). Is there a way to join by excluding the one variable that is not used?

Ok, using this example from one answer:

set.seed(24)
df1 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs= 
                  letters[1:3], val = rnorm(3))
df2 <- data_frame(alala= LETTERS[1:3], skks= letters[1:3], sskjs= 
                   letters[1:3], val = rnorm(3))

I want to join them using all variables excluding val. I'm looking for a more general solution. Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable. How can I perform the join while only knowing the variable names to exclude. I understand I can find the column index first but is there a simply way to add exclusions in by =?

Although your question seems clear, please share a minimal reproducible example that mimics your problem. — markus, Oct 17 '18 at 18:38
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Oct 17 '18 at 18:38
@JianghuiDu Note that if you do not know the index of the column to exclude, and the columns to match are not named the same in both data frames, how do you know that you will correctly pair the columns if there is no pattern to the names? — Calum You, Oct 17 '18 at 18:59
sorry, my fault. these names should be the same. Corrected in the example. — JianghuiDu, Oct 17 '18 at 19:06

akrun · Answer 1 · 2018-10-17T18:51:15.550

3

We create a named vector to do this

library(dplyr)
grps <- setNames(paste0("b", 1:999), paste0("a", 1:999))

Note the 'grps' vector is created with paste as the OP's post suggested a pattern. If there is no pattern, but we know the column that is not to be grouped

nogroupColumn <- "someColumn"
grps <- setNames(setdiff(names(df1), nogroupColumn), 
                   setdiff(names(df2), nogroupColumn))

inner_join(df1, df2, by = grps)

Using a reproducible example

set.seed(24)
df1 <- data_frame(a1 = LETTERS[1:3], a2 = letters[1:3], val = rnorm(3))
df2 <- data_frame(b1 = LETTERS[3:4], b2 = letters[3:4], valn = rnorm(2))
grps <- setNames(paste0("b", 1:2), paste0("a", 1:2))

inner_join(df1, df2, by = grps)
# A tibble: 1 x 4
#  a1    a2      val   valn
#  <chr> <chr> <dbl>  <dbl>
#1 C     c     0.420 -0.584

edited Oct 17 '18 at 18:51

answered Oct 17 '18 at 18:40

akrun

874,273
37
540
662

OK. The question I'm thinking is not this simple. In this example the variable names are simply and you can do this, but what about the variable names has no "rules" like in his simple example? I guess my point is that I know the name that I do not want to join, while these ones that I want to join are too many and I can't remember all their names. – JianghuiDu Oct 17 '18 at 18:44
2

@JianghuiDu In your post, there seems to be a pattern. If you know the index of columns then, `grps <- setNames(names(df1)[-1], names(df2)[-1])` – akrun Oct 17 '18 at 18:46
Assuming I only know the name but not the index. – JianghuiDu Oct 17 '18 at 18:50
1

Super clever. I see the new question reads: "Assuming there are 1000 variables and I only remember the name of the one that I want to exclude in the join, while not knowing the index of that variable." – Nettle Oct 18 '18 at 03:15

Nettle · Answer 2 · 2018-10-18T03:18:18.367

To exclude a certain field(s), you need to identify the index of the columns you want. Here's one way:

which(!names(df1) %in% "sskjs" ) #<this excludes the column "sskjs"
[1] 1 2 4                        #<and shows only the desired index columns

Use unite to create a join_id in each dataframe, and join by it.

df1 <- df1 %>% 
    unite(join_id, which(!names(.) %in% "sskjs"), remove = F)

df2 <- df2 %>% 
    unite(join_id, which(!names(.) %in% "sskjs"), remove = F)

left_join(df1, df2, by = "join_id" )

dplyr join by exclusion?

2 Answers2