0

I am trying to merge two data frames by group id. However, both data frames are not of the same length and some elements of certain groups are missing in the second data frame. In the merged file, the missing elements of a certain group should be NAs.

The data looks something like this

df1 <- data.frame(id = c(1,1,1,2,3,3,4), x = c("a", "b", "c", "d", "e", "f", "g"))
df2 <- data.frame(id = c(1,1,2,3,4), y = c("A", "B", "D", "E", "G"))

Ideally, the result would look like this:

id  x    y
1   a    A
1   b    B
1   c <NA>
2   d    D
3   e    E
3   f <NA>
4   g    G

It would be great if the code worked for additional columns that also correspond to the same group ids but may miss elements at different places.

I have tried full_join and merge so far but without success, as they just repreat the y values instead of introducing na's.

I know there are similar questions out there, but I have found none that solves this problem. Any help is appreciated.

Claudio
  • 10,614
  • 4
  • 31
  • 71
Jonas C
  • 5
  • 4
  • When joining on `id`, how would R get to know that one time id = 1 --> y = A, and the other time id = 1 --> Y = B... Is there some relation between x and y? – Wimpel May 17 '19 at 08:21
  • Not explicity, that is the problem. This is the result of some manipulations I did beforehand and technically all that links the two is how they are sorted. So if group 1 consists of x = a,b,c, then any non-missing values in y will necessarily be in the same order. Only the last n values of a certain group can me missing. – Jonas C May 17 '19 at 08:27
  • not a duplicate, because some data-preparation is needed befor joining. The actual join is not the problem/question here. – Wimpel May 17 '19 at 08:33
  • Maybe you can explain the process better. All I'm seeing is that you're matching lowercase letters in x with their uppercase versions in y, if those uppercase versions are present. Is there more to it? Or could you just add a lowercase version to y, and left-join on ID and lowercase? – camille May 17 '19 at 13:41

1 Answers1

1

This data.table solution might work..

first, create row_id's per group. The join by id on these row id's.

library(data.table)
dt1 <- data.table(id = c(1,1,1,2,3,3,4), x = c("a", "b", "c", "d", "e", "f", "g"))
dt2 <- data.table(id = c(1,1,2,3,4), y = c("A", "B", "D", "E", "G"))

#rumber rows by group
dt1[ , row_id := seq.int(1:.N), by = .(id)]
dt2[ , row_id := seq.int(1:.N), by = .(id)]

dt1[dt2, y := i.y, on = .(id, row_id)][, row_id := NULL][]

#    id x    y
# 1:  1 a    A
# 2:  1 b    B
# 3:  1 c <NA>
# 4:  2 d    D
# 5:  3 e    E
# 6:  3 f <NA>
# 7:  4 g    G
Wimpel
  • 26,031
  • 1
  • 20
  • 37