Merge to datasets with join - dropping double values in one table

Question

I have two tibbles:

a <- tibble(month=c("Jan", "Feb", "Jan", "Feb"),
   x=c(1,1,2,2))
b <- tibble(x=c(1,2,1,2),
   y=c("a", "b", "c", "d"),
   z=c("m", "n", "m", "n"))

which I want to join. However, I am not interested in the additional information provided by variable y; I know that for any value in x, there is only one value in z. So, the desired outcome is:

# A tibble: 4 x 3
  month     x z    
  <chr> <dbl> <chr>
1 Jan       1 m    
2 Feb       1 m    
3 Jan       2 n    
4 Feb       2 n

But using left_join, all the values "double":

> left_join(a, b, by="x")
# A tibble: 8 x 4
  month     x y     z    
  <chr> <dbl> <chr> <chr>
1 Jan       1 a     m    
2 Jan       1 c     m    
3 Feb       1 a     m    
4 Feb       1 c     m    
5 Jan       2 b     n    
6 Jan       2 d     n    
7 Feb       2 b     n    
8 Feb       2 d     n

which is of course understandable, but - in my case - undesired. I tried collapsing the table using group_by(month) %>% summarise(z=z), but this does not work, because summarise can't seem to deal with factors. What would be a solution?

The issue is with the duplicate values in `'x` i.e there is no unique identifier in both datasets. So, it cannot differentiate which 1 in 'x' to join with the 'x' for 1 in 'b' — akrun, Oct 03 '18 at 17:35
You could do `left_join(a, distinct(b, x, .keep_all = TRUE))` but really the data is far from tidy, so fixing that seems like the right way to go. If you're interested: https://www.jstatsoft.org/article/view/v059i10 Re not wanting y, you can drop it with select before joining, I guess. — Frank, Oct 03 '18 at 17:40
I specifially thought it was tidy - every row is one observation? This might be the fault of my MWE. I found a [solution](https://stats.stackexchange.com/questions/6759/removing-duplicated-rows-data-frame-in-r) to use unique on `b %>% select(-y)`. — Lukas, Oct 03 '18 at 17:45
[@akrun](https://stackoverflow.com/users/3732271/akrun): you're right - however, thats the way my data is organised. Every x appears in every month, and while for every x, there are multiple y, there is only on z... — Lukas, Oct 03 '18 at 17:47
Possible duplicate of [Remove duplicated rows using dplyr](https://stackoverflow.com/questions/22959635/remove-duplicated-rows-using-dplyr) — IceCreamToucan, Oct 03 '18 at 18:26
@IceCreamToucan: the solution is the same, but the question is different. — Lukas, Nov 09 '18 at 14:13

score 0 · Accepted Answer · answered Oct 03 '18 at 17:50

0

The answer is (found here):

a %>%
  left_join(b %>%
              select(x, z) %>%
              unique())

answered Oct 03 '18 at 17:50

Lukas

424
3
6
17

Merge to datasets with join - dropping double values in one table

1 Answers1