Join two dataframes to fill the missing data

Question

I have two different dataframes:

df1 <- tibble(group = c(rep(1, 3), rep(2, 4), rep(1, 3)),
       id = paste0("minutesPrompt", c(1, 2, 3, 1, 2, 3, 4, 1, 2, 3)),
       number = c(rep("a", 3), rep("b", 4), rep("c", 3)),
       minutesPrompt = c(1, 2, 4, 9, 18, 27, 36, 2, 3, 5),
       timestamp = rep("xxxxxx", 10),
       text1 = c("String", rep(NA_character_, 6), rep("String", 3)),
       text2 = c(NA_character_, "String", rep(NA_character_, 5), "String", rep(NA_character_, 2)),
       text3 = c(rep(NA_character_, 2), "String", rep(NA_character_, 7)))

df2 <- tibble(group = rep(2, 7),
              id = paste0("minutesPrompt", c(1, 2, 3, 4, 1, 2, 3)),
              number = c(rep("b", 4), rep("x", 3)),
              minutesPrompt = NA,
              timestamp = rep("xxxxxx", 7),
              text1 = c("String", rep(NA_character_, 6)),
              text2 = c(rep(NA_character_, 2), "String", rep(NA_character_, 4)),
              text3 = c(NA_character_, "String", rep(NA_character_, 5)))

df1 (first picture) which is really big: This dataframe consists of a lot of variables, and includes the values of 3 different groups. It further has 7 rows for each participant expressed by id.
df2 (second picture) contrarily consists only of the variables shown for one group only. The difference between the datasets is also that df1 has some missing values (yellow). The strings that should be transferred into those empty cells are included in df2 (orange).

My plan is to conduct a full join so that I can replace the missing information in df1 on "timestamp", "text1", "text2", until "text7" by the provided values of df2. I have tried this:

full_join(df1, df2) %>%
   group_by("id", "number")

However this does not replace my missing cells (highlighted in yellow) by the strings in df2.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need all of your real data, just something small would be helpful to see what's going on. — MrFlick, Jul 23 '20 at 21:29
You do not specify which fields to join on. By default, `dplyr::full_join` will join on any common columns names between both sets, not necessarily `id` and `number`. — Parfait, Jul 23 '20 at 21:37
Thank you Parfait. R took all my variables of the df2 as keys I guess. But how can I specify the columns? — Nadine M., Jul 23 '20 at 21:49
@NadineM. In this case you can just add the argument `by = character vector of column names`, so in this case `full_join(df1,df2, by = c("id", "number"))` will join on id and number — NotThatKindODr, Jul 23 '20 at 23:13
Please in code questions give a [mre]--cut & paste & runnable code, including smallest representative example input as code; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. Give the least code you can that is code that you show is OK extended by code that you show is not OK. (Debugging fundamental.) For R/SQL that includes DBMS & DDL (including constraints & indexes) & input as code formatted as a table. [ask] Pause work on the overall goal, chop code to the 1st expression not giving what you expect & say what you expect & why. — philipxy, Jul 23 '20 at 23:34
Also read https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right . Most probably the answer you are looking for is already there. — Ronak Shah, Jul 24 '20 at 01:08
Please [use text, not images/links, for text--including tables & ERDs](https://meta.stackoverflow.com/q/285551/3404097). Use images only for what cannot be expressed as text or to augment text. Include a legend/key & explanation with an image. — philipxy, Jul 24 '20 at 20:04
@NotThatKindODr : Your code also works well, thank you. The thing is only that it extended my data set because it gives me all the variables in df1 with .x (e.g. id.x, number.x etc. and then all the variables again from df2 with .y (e.g. id.y, number.y) so that you have then to find a solution how to integrate the .x an .y columns for each of the variables — Nadine M., Jul 25 '20 at 08:10

M-- · Accepted Answer · 2020-07-24T19:53:04.657

We can use data.table package:

library(data.table)

setDT(df1)[setDT(df2), `:=` ( timestamp  = i.timestamp,
                              text1 = i.text1,
                              text2 = i.text2,
                              text3 = i.text3 ), 
            on = .(id, number)][] ## may wanna add `group` column to `on` arguments

#>    group             id number minutesPrompt timestamp  text1  text2  text3
#> 1:     1 minutesPrompt1      a             1    xxxxxx String   <NA>   <NA>
#> 2:     1 minutesPrompt2      a             2    xxxxxx   <NA> String   <NA>
#> 3:     1 minutesPrompt3      a             4    xxxxxx   <NA>   <NA> String
#> 4:     2 minutesPrompt1      b             9    xxxxxx String   <NA>   <NA>
#> 5:     2 minutesPrompt2      b            18    xxxxxx   <NA>   <NA> String
#> 6:     2 minutesPrompt3      b            27    xxxxxx   <NA> String   <NA>
#> 7:     2 minutesPrompt4      b            36    xxxxxx   <NA>   <NA>   <NA>
#> 8:     1 minutesPrompt1      c             2    xxxxxx String String   <NA>
#> 9:     1 minutesPrompt2      c             3    xxxxxx String   <NA>   <NA>
#> 10:    1 minutesPrompt3      c             5    xxxxxx String   <NA>   <NA>

thank you so much, your code works really well. And I did not have to add the 'group' column :-). — Nadine M., Jul 25 '20 at 08:05

Join two dataframes to fill the missing data

1 Answers1