Why does `full_join` increase the numbers of rows when there are duplicates?

Question

I noticed full_join and been doubling rows when I am matching on rows with duplicates id's. Is there a way I can include the unique rows from two datasets without duplicating data? I could imagine making another unique identifier.

library(dplyr)

## two row output as expected
x <- tibble(id = c(1,2))
full_join(x, x, by="id")
#> # A tibble: 2 x 1
#>      id
#>   <dbl>
#> 1     1
#> 2     2


## 1's double
y <- tibble(id = c(1,1,2))
full_join(y, y, by="id")
#> # A tibble: 5 x 1
#>      id
#>   <dbl>
#> 1     1
#> 2     1
#> 3     1
#> 4     1
#> 5     2

^{Created on 2020-07-21 by the reprex package (v0.3.0)}

That's how a `full_join` works? Not being sarcastic but that's what I would expect from a full join. My suggestion is to deduplicate both datasets before joining. — CPak, Jul 22 '20 at 03:12
Check out this SQL answer: https://stackoverflow.com/a/27458534 — Ian Campbell, Jul 22 '20 at 03:22
Are you sure you don't want a `union` or `bind_rows` operation rather than a `join`? I'm guessing a bit here but I get the impression you want to stack your data rather than adding columns from one dataset to the other. — thelatemail, Jul 22 '20 at 04:35
quick solution. Add `%>% unique()` at the end of the `full_join`. At least you'll have unique values. — Víctor Cortés, Jul 22 '20 at 04:57
The key issue here is that dplyr does not and should not assume that these are duplicates. The question is also not really clear. Are you querying how `full_join` works or are you simply asking how to remove duplicates from a data frame? In that case just use `distinct`. — Robert Wilson, Jul 22 '20 at 07:31

score 1 · Accepted Answer · answered Jul 22 '20 at 05:10

I think the answer to your question has already been given in the comments. But I will try to illustrate the problem in a little more detail. Maybe it helps :)

First of all: a join tries to match each id-element of dataframe A to a corresponding id-element of dataframe B. If there are duplicate ids in A or B, the join will find produce multiple matches. This is not a special case of a full join.

Here are a few examples (the most common joins I think). I have added an extra column so you can see how the values are joined.

x <- dplyr::tibble(id=c(1,1,2,3,4),
                   val1=c(1,2,3,4,5))
y <- dplyr::tibble(id=c(1,1,3),
                   val2=c("a","b","c"))

full join

dplyr::full_join(x,y,by="id")

# A tibble: 7 x 3
     id  val1 val2 
  <dbl> <dbl> <chr>
1     1     1 a    
2     1     1 b    
3     1     2 a    
4     1     2 b    
5     2     3 NA   
6     3     4 c    
7     4     5 NA

left join

dplyr::left_join(x,y,by="id")

# A tibble: 7 x 3
     id  val1 val2 
  <dbl> <dbl> <chr>
1     1     1 a    
2     1     1 b    
3     1     2 a    
4     1     2 b    
5     2     3 NA   
6     3     4 c    
7     4     5 NA

right join

# A tibble: 5 x 3
     id  val1 val2 
  <dbl> <dbl> <chr>
1     1     1 a    
2     1     2 a    
3     1     1 b    
4     1     2 b    
5     3     4 c

anti join

dplyr::anti_join(x,y,by="id")

    id  val1
  <dbl> <dbl>
1     2     3
2     4     5

Why does `full_join` increase the numbers of rows when there are duplicates?

1 Answers1