0

Q: how do I merge dataframes with unique rows and varying numbers of shared columns?

Background

I have a large number of different datasets of survey respondents that I want to combine into a single dataframe (or tibble, if we're using dplyr). Each respondent is unique; no respondent appears in more than one dataset.

This would be a simple task if each dataset had each variable in the same column order, because the solution would be:

combined <- rbind(df1, df2, df3, dfn)

Sadly, that's not the state of these data. Instead, each dataset has a variable for respondent ID number, year of survey, country of survey, and a varying number of questions that do not appear in the same column order in the raw dataset. They do have the same variable names, however.

That question again, given the background

How do I merge these datasets into a single dataframe, given that a) each respondent row is unique, b) not all variables appear in all datasets, and c) not all variables appear in the same column order?

Some example data

A simple start with just two dataframes (tibbles, actually) is:

df1 <-
    tribble(~ rid,  ~ a,  ~ b,  ~ c,
                1, "a1", "b1", "c1",
                2, "a2", "b2", "c2",
                3, "a3", "b3", "c3")

df2 <-
    tribble(~ rid,  ~ a,  ~ d,  ~ c,  ~ e,
                4, "a4", "d1", "c4", "e1",
                5, "a5", "d2", "c5", "e2",
                6, "a6", "d3", "c6", "e3")

Respondents are identified by rid. Each question is identified by one of the other variables. In this example, questions a and c were asked in both surveys, but b was asked only in the first survey, and d and e were asked only in the second survey.

The goal is to have respondents in the rows and variables in columns, filled in with NAs when a variable does not appear is a dataset.

What I tried

I'm a bit of a goober with base R's merge operations and dplyr's *_join operations. I'm not even sure this is the correct way to go about solving this problem, or if "merging" is even the correct verb. Therefore, perhaps naively, I tried a full_join(df1, df2, by = "rid"), which yields:

# A tibble: 6 x 8
    rid   a.x     b   c.x   a.y     d   c.y     e
  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     1    a1    b1    c1  <NA>  <NA>  <NA>  <NA>
2     2    a2    b2    c2  <NA>  <NA>  <NA>  <NA>
3     3    a3    b3    c3  <NA>  <NA>  <NA>  <NA>
4     4  <NA>  <NA>  <NA>    a4    d1    c4    e1
5     5  <NA>  <NA>  <NA>    a5    d2    c5    e2
6     6  <NA>  <NA>  <NA>    a6    d3    c6    e3

Instead, what I'm trying to get is:

    rid     a     b     c     d     e
  <dbl> <chr> <chr> <chr> <chr> <chr>
1     1    a1    b1    c1  <NA>  <NA>
2     2    a2    b2    c2  <NA>  <NA>
3     3    a3    b3    c3  <NA>  <NA>
4     4    a4  <NA>    c4    d1    e1
5     5    a5  <NA>    c5    d2    e2
6     6    a6  <NA>    c6    d3    e3

Another solution that didn't work

I tried the solution in Simultaneously merge multiple data.frames in a list, but it gave me wonky results:

list(df1, df2) %>%
    Reduce(function(dtf1, dtf2) full_join(dtf1, dtf2, by="rid"), .)

Which yields:

# A tibble: 6 x 8
    rid   a.x     b   c.x   a.y     d   c.y     e
  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     1    a1    b1    c1  <NA>  <NA>  <NA>  <NA>
2     2    a2    b2    c2  <NA>  <NA>  <NA>  <NA>
3     3    a3    b3    c3  <NA>  <NA>  <NA>  <NA>
4     4  <NA>  <NA>  <NA>    a4    d1    c4    e1
5     5  <NA>  <NA>  <NA>    a5    d2    c5    e2
6     6  <NA>  <NA>  <NA>    a6    d3    c6    e3

I don't know why a and c are not merging appropriately.

Dan
  • 5,209
  • 1
  • 25
  • 37
  • Would you have same "rid" in different dataframes? – zx8754 Jan 14 '18 at 22:06
  • 1
    @zx8754: I had found that question and tried the solution proposed there, but it does not merge the shared columns correctly (see edits to post). That is, `list(df1, df2) %>% Reduce(function(dtf1, dtf2) full_join(dtf1, dtf2, by="rid"), .)` gives me a 6x8 dataframe (not 6x6) whose column names are `rid`, `a.x`, `b`, `c.x`, `a.y`, `d`, `c.y`, and `e`. – Dan Jan 14 '18 at 22:07
  • @zx8754: the raw, raw data probably has duplicate ID numbers between different datasets, but I can uniqueify them easily enough. – Dan Jan 14 '18 at 22:09
  • Maybe simply: `dplyr::bind_rows(df1, df2)` ? – zx8754 Jan 14 '18 at 22:09
  • Please note that the column order itself is not a problem from `rbind` (`rbind(data.frame(a = 1, b = 2), data.frame(b = 3, a = 4))`) - it's when _columns differ_. You may wish to phrase your question accordingly. – Henrik Jan 14 '18 at 22:13
  • @Henrik: the solution in the post you highlighted works, although it was a little convoluted to find. (The short version: `rbindlist(list(df1, df2), fill = TRUE)` gets the job done.) No objections if this post is marked as a duplicate, although the `bind_rows` solution mentioned in the comments also works, is quite simple, and has the added advantage of using only `dplyr` rather than adding `data.table` to the mix. – Dan Jan 14 '18 at 22:17
  • @zx8754: yup, `bind_rows` does the job (didn't know that existed). Please post it as an answer so I can accept it. – Dan Jan 14 '18 at 22:18
  • `bind_rows` is part of the linked post, it is used be called `rbind_all`. – zx8754 Jan 14 '18 at 22:22
  • 1
    @zx8754 I just pinged Arun in the link that he may wish to update his answer to include `bind_rows` instead. – Henrik Jan 14 '18 at 22:30

0 Answers0