1

Thank you in advance for your help.

I have a baseline dataset of around 30000 individuals. Each individual has a unique ID number. I also have a follow-up dataset with the same people, with maybe 2000 individuals lost to follow-up. I'm trying to merge these datasets, matching the data from both datasets for each ID number. For individuals who have been lost to follow-up, I would like to keep them in the merged dataset, but their row would probably need to contain a bunch of NAs since outcomes couldn't be measured in the follow-up dataset.

Is there a way in R to go about this?

(As a relatively new R user, I don't really know how to even begin approaching this problem. I have a feeling I'd need to use dplyr, but matching individuals from two datasets according to their ID and generating NAs for those who were lost to follow-up are beyond me. Any help or hints would be appreciated.)

awastus
  • 11
  • 2
  • FYI, some really good discussions on forming questions to be more _reproducible_: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. Welcome! – r2evans Dec 08 '22 at 17:11

1 Answers1

1

You can use merge with the all.x = TRUE command - put the baseline data first then the follow up data second in the merge statement. For instance, say your baseline data is bl and follow up is fu, but you are missing the last 5 patients to follow up:

bl <- data.frame(id = 1:20,
                var_bl = letters[1:20])

fu <- data.frame(id = 1:15,
               var_fu = letters[1:15])

alldata <- merge(bl, fu, by = "id", all.x = TRUE)

Output:

   id var_bl var_fu
1   1      a      a
2   2      b      b
3   3      c      c
4   4      d      d
5   5      e      e
6   6      f      f
7   7      g      g
8   8      h      h
9   9      i      i
10 10      j      j
11 11      k      k
12 12      l      l
13 13      m      m
14 14      n      n
15 15      o      o
16 16      p   <NA>
17 17      q   <NA>
18 18      r   <NA>
19 19      s   <NA>
20 20      t   <NA>

Note for future reference there is an all.y (which would keep all obs in the second dataset (ie, fu)) and an all statement that would keep all obs in both)

Note the comments on your question by r2evans, but for convenience, one dplyr approach would be:

dplyrdata <- dplyr::left_join(bl, fu)

Which would output the same data as above.

jpsmith
  • 11,023
  • 5
  • 15
  • 36