Q: how do I merge dataframes with unique rows and varying numbers of shared columns?
Background
I have a large number of different datasets of survey respondents
that I want to combine into a single dataframe (or tibble, if
we're using dplyr
). Each respondent is unique; no respondent
appears in more than one dataset.
This would be a simple task if each dataset had each variable in the same column order, because the solution would be:
combined <- rbind(df1, df2, df3, dfn)
Sadly, that's not the state of these data. Instead, each dataset has a variable for respondent ID number, year of survey, country of survey, and a varying number of questions that do not appear in the same column order in the raw dataset. They do have the same variable names, however.
That question again, given the background
How do I merge these datasets into a single dataframe, given that a) each respondent row is unique, b) not all variables appear in all datasets, and c) not all variables appear in the same column order?
Some example data
A simple start with just two dataframes (tibbles, actually) is:
df1 <-
tribble(~ rid, ~ a, ~ b, ~ c,
1, "a1", "b1", "c1",
2, "a2", "b2", "c2",
3, "a3", "b3", "c3")
df2 <-
tribble(~ rid, ~ a, ~ d, ~ c, ~ e,
4, "a4", "d1", "c4", "e1",
5, "a5", "d2", "c5", "e2",
6, "a6", "d3", "c6", "e3")
Respondents are identified by rid
. Each question is identified
by one of the other variables. In this example, questions a
and
c
were asked in both surveys, but b
was asked only in the
first survey, and d
and e
were asked only in the second
survey.
The goal is to have respondents in the rows and variables in
columns, filled in with NA
s when a variable does not appear is a
dataset.
What I tried
I'm a bit of a goober with base R's merge
operations and
dplyr
's *_join
operations. I'm not even sure this is the
correct way to go about solving this problem, or if "merging" is
even the correct verb. Therefore, perhaps naively, I tried a
full_join(df1, df2, by = "rid")
, which yields:
# A tibble: 6 x 8
rid a.x b c.x a.y d c.y e
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a1 b1 c1 <NA> <NA> <NA> <NA>
2 2 a2 b2 c2 <NA> <NA> <NA> <NA>
3 3 a3 b3 c3 <NA> <NA> <NA> <NA>
4 4 <NA> <NA> <NA> a4 d1 c4 e1
5 5 <NA> <NA> <NA> a5 d2 c5 e2
6 6 <NA> <NA> <NA> a6 d3 c6 e3
Instead, what I'm trying to get is:
rid a b c d e
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 a1 b1 c1 <NA> <NA>
2 2 a2 b2 c2 <NA> <NA>
3 3 a3 b3 c3 <NA> <NA>
4 4 a4 <NA> c4 d1 e1
5 5 a5 <NA> c5 d2 e2
6 6 a6 <NA> c6 d3 e3
Another solution that didn't work
I tried the solution in Simultaneously merge multiple data.frames in a list, but it gave me wonky results:
list(df1, df2) %>%
Reduce(function(dtf1, dtf2) full_join(dtf1, dtf2, by="rid"), .)
Which yields:
# A tibble: 6 x 8
rid a.x b c.x a.y d c.y e
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 a1 b1 c1 <NA> <NA> <NA> <NA>
2 2 a2 b2 c2 <NA> <NA> <NA> <NA>
3 3 a3 b3 c3 <NA> <NA> <NA> <NA>
4 4 <NA> <NA> <NA> a4 d1 c4 e1
5 5 <NA> <NA> <NA> a5 d2 c5 e2
6 6 <NA> <NA> <NA> a6 d3 c6 e3
I don't know why a
and c
are not merging appropriately.