How to merge factors when binding two dataframes together?

Question

Here is a fairly minimal reproducing code. The real dataset is larger and has many factors, so manually listing factors is not practical. There are also more interesting transformations on the data, for which I want to keep using dplyr.

library(dplyr)
a = data.frame(f=factor(c("a", "b")), g=c("a", "a"))
b = data.frame(f=factor(c("a", "c")), g=c("a", "a"))
a = a %>% group_by(g) %>% mutate(n=1)
b = b %>% group_by(g) %>% mutate(n=2)
rbind(a,b)

This produces:

# A tibble: 4 x 3
# Groups:   g [1]
      f      g     n
  <chr> <fctr> <dbl>
1     a      a     1
2     b      a     1
3     a      a     2
4     c      a     2
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector

These warnings are annoying, and would actually disappear if I did not use the group_by:

> a = data.frame(f=factor(c("a", "b")), g=c("a", "a"))
> b = data.frame(f=factor(c("a", "c")), g=c("a", "a"))
> a = a %>% mutate(n=1)
> b = b %>% mutate(n=2)
> rbind(a,b)
  f g n
1 a a 1
2 b a 1
3 a a 2
4 c a 2

Explicitly converting to data.frame just before rbind also works:

> rbind(data.frame(a),data.frame(b))
  f g n
1 a a 1
2 b a 1
3 a a 2
4 c a 2

Is there an easy way with base R or dplyr rbind/bind_rows to automatically merge those factors and their levels instead of converting them to character (which makes little sense to me), while still using dplyr for data transformations?

I found https://stackoverflow.com/a/30468468/388803 which proposes a solution to merge the factors manually, but this is very verbose.

My actual use-case is loading two .csv files with read.table, doing some data transformations and then merging the data as they are complementary. My current workaround is to call data.frame(data) at the end of the end of data transformations. I wonder why dplyr/tibble does not automatically merge factors as it seems safe in such a situation. Is this something that could be improved maybe?

To avoid the warnings, perhaps the dataset `factor` column `levels` could be changed before to accommodate the `levels` in the other dataset. Sort of like a `union` — akrun, Oct 22 '17 at 16:15
@akrun Yes, that's one way to do it, as in the linked post, but I don't want to do this manually and the real dataset has many factors and levels. — eregon, Oct 22 '17 at 16:20
Another workaround may be to use `stringsAsFactors = FALSE` and cast those columns you want as `factor` _only after_ binding your input files together — MichaelChirico, Oct 22 '17 at 16:28

score 4 · Answer 1 · answered Jul 18 '18 at 01:41

I came across this question while figuring out a similar task. Using forcats::lvls_union, you can get a character vector of all the levels in a list of factors—in this case, a$f and b$f. Then you can use forcats::fct_expand to expand each data frame's f to have that union of factor levels.

library(tidyverse)

a <- data.frame(f = factor(c("a", "b")), g = c("a")) %>%
  mutate(n = 1) %>%
  group_by(g)

b <- data.frame(f = factor(c("a", "c")), g = c("a")) %>%
  mutate(n = 2) %>%
  group_by(g)

all_lvls <- lvls_union(list(a$f, b$f))

After getting the union of levels, you can mutate both data frames and call bind_rows:

bind_rows(
  a %>% mutate(f = fct_expand(f, all_lvls)),
  b %>% mutate(f = fct_expand(f, all_lvls))
)
#> # A tibble: 4 x 3
#> # Groups:   g [1]
#>   f     g         n
#>   <fct> <fct> <dbl>
#> 1 a     a         1
#> 2 b     a         1
#> 3 a     a         2
#> 4 c     a         2

Or, to get the same result, you can map over a list of the two data frames to expand each f. Using map_dfr is a shorthand, like calling map, then piping into bind_rows.

map_dfr(list(a, b), ~mutate(., f = fct_expand(f, all_lvls)))
#> # A tibble: 4 x 3
#> # Groups:   g [1]
#>   f     g         n
#>   <fct> <fct> <dbl>
#> 1 a     a         1
#> 2 b     a         1
#> 3 a     a         2
#> 4 c     a         2

Created on 2018-07-17 by the reprex package (v0.2.0).

score 3 · Answer 2 · answered Oct 22 '17 at 16:17

3

Solution using data.table.
Convert your data.frame into a data.table and add n using := (no need of dplyr).

a <- data.frame(f=factor(c("a", "b")), g=c("a", "a"))
b <- data.frame(f=factor(c("a", "c")), g=c("a", "a"))
library(data.table)
rbind(setDT(a)[, n := 1], 
      setDT(b)[, n := 2])
   f g n
1: a a 1
2: b a 1
3: a a 2
4: c a 2

answered Oct 22 '17 at 16:17

pogibas

27,303
19
84
117

1

actually there's no need to declare `n` at all -- just use the `idcol` argument: `rbind(a, b, idcol = 'n')`. This appears to be a `dplyr` bug, at core. if we write `a$n = 1; b$n = 1; rbind(a, b)` (i.e., do this in `base`), there's no error. – MichaelChirico Oct 22 '17 at 16:21
Right, that's another workaround. But of course in my real case I have a few transformations with dplyr not trivial to replace like this, and a more realistic/larger dataset. – eregon Oct 22 '17 at 16:22
3

@eregon i suggest 1) filing a bug with `dplyr` and 2) making your example mimic your use case more, since this answer solves your question as posed – MichaelChirico Oct 22 '17 at 16:26
1) Yeah I wanted to do that first, but they redirect to SO and their mailing list (to which I also posted about the "why" part: https://groups.google.com/forum/#!topic/manipulatr/CxQQMhqOxZg). 2) I added a couple sentences in the question to clarify this is minimal and how it differs with the real dataset/transformations. – eregon Oct 22 '17 at 16:29
Official issue tracker is [here](https://github.com/tidyverse/dplyr/issues) – MichaelChirico Oct 22 '17 at 16:36
Opening an issue shows: "Please briefly describe your problem and what output you expect. If you have a question, please don't use this form, but instead ask on the mailing list or http://stackoverflow.com." – eregon Oct 22 '17 at 17:03
@eregon right, but as explored here, this is a bug you've identified, so you should file as such – MichaelChirico Oct 22 '17 at 17:04
I actually quite like that `dplyr` prints a warning here and coerces it to a character. I would argue that you should only ever `rbind` together two data frames that are _exactly_ the same - types and all. – Moderat Oct 23 '17 at 06:54
@Moderat I beg to disagree :) They have the same type to me in the original .csv, the input is just strings and unifying 2 sets of strings seems perfectly defined. – eregon Oct 30 '17 at 21:19

score 2 · Answer 3 · edited Jul 17 '18 at 15:48

If the factors are just an efficient storage of strings, one could convert them to strings before merging and convert to factor afterwards:

bind_rowsFactors <- function(
  ### bind_rows on two data.frames with merging factors levels
  a      ##<< first data.frame to bind
  , b    ##<< second data.frame to bind
  , ...  ##<< further arguments to \code{bind_rows}
){
  isInconsistentFactor <- sapply( names(a),  function(col){
    (is.factor(a[[col]]) | is.factor(b[[col]])) &&
      any(levels(a[[col]]) != levels(b[[col]]))
  })
  if (sum(isInconsistentFactor)) warning(
    "releveling factors ", paste(names(a)[isInconsistentFactor], collapse = ","))
  for (col in names(a)[isInconsistentFactor]) {
    a <- mutate(ungroup(a), !!col := as.character(!!rlang::sym(col)))
    b <- mutate(ungroup(b), !!col := as.character(!!rlang::sym(col)))
  }
  ans <- bind_rows(a, b, ...)
  # convert former factors form string back to factor
  for (col in names(ans)[isInconsistentFactor]) {
    ans <- mutate(ungroup(ans), !!col := factor(!!rlang::sym(col)))
  }
  ##value<< result of \code{bind_rows} with inconsistend factor columns still factors
  ans
}

library(dplyr)
a = data.frame(f = factor(c("a", "b")), g = c("a", "a"))
b = data.frame(f = factor(c("a", "c")), g = c("a", "a"))
a = a %>% group_by(g) %>% mutate(n = 1)
b = b %>% group_by(g) %>% mutate(n = 2)
#bind_rows(a,b)
bind_rowsFactors(a,b)

The strange !!rlang::sym notation is just a workaround for non-standard evealuation with dplyr and undefined symbols.

The above code issues a warning on redefining factor levels of f, but otherwise returns the bound data.frame with column f being a factor.

# A tibble: 4 x 3
  f     g         n
  <fct> <fct> <dbl>
1 a     a        1.
2 b     a        1.
3 a     a        2.
4 c     a        2.
Warning message:
In bind_rowsFactors(a, b) : releveling factors f

How to merge factors when binding two dataframes together?

3 Answers3