6

haven::read_dta supports importing variable label from Stata into R using the label attribute. Rstudio also supports displaying these labels in the View pane.

However, when two data frames are bound using dplyr::bind_rows (or rbind_all), the labels are not preserved. Is this a bug?

library(dplyr)
id <- 1:5
attr(id, "label") <- "unit id"

df1 <- tbl_df(data.frame(id)) # label is fine
df1$id
# [1] 1 2 3 4 5
# attr(,"label")
# [1] "unit id"

df2 <- tbl_df(data.frame(id)) # label is fine
df2$id
# [1] 1 2 3 4 5
# attr(,"label")
# [1] "unit id"

df_bound <- bind_rows(df1, df2) # label is gone
df_bound$id
# [1] 1 2 3 4 5 1 2 3 4 5
Heisenberg
  • 8,386
  • 12
  • 53
  • 102
  • 1
    I edited because it wasn't clear about what you were talking about. –  Jan 20 '16 at 01:53
  • 2
    Interesting question. [This dplyr blog post](http://blog.rstudio.org/2015/09/04/dplyr-0-4-3/) says "All functions should now copy column attributes from the input to the output..." – tospig Jan 20 '16 at 01:57
  • 2
    Just in case, I tried with the devel version of `dplyr`, i.e. `0.4.3.9000`, but doesn't work neither. –  Jan 20 '16 at 02:01

4 Answers4

2

A workaround is to use rbind instead of bind_rows. You must then make sure that the column names are equal.

Use setdiff(names(df1), names(df2)) to get column names that are in df1 but not in df2, and setdiff(names(df2), names(df1)) vice versa.

Lewistrick
  • 2,649
  • 6
  • 31
  • 42
1

The sjlabelled package by Daniel Lüdecke is a nice solution for problems like this when working with labelled data. I used the copy_labels function for a similar issue :

library(dplyr)  
library(sjlabelled) 
id <- 1:5  
attr(id, "label") <- "unit id"  
df1 <- tbl_df(data.frame(id))  
str(df1)   
# tibble [5 × 1] (S3: tbl_df/tbl/data.frame)  
# $ id: int [1:5] 1 2 3 4 5  
# ..- attr(*, "label")= chr "unit id"  
df2 <- tbl_df(data.frame(id)) # label is fine  
df_bound <- bind_rows(df1, df2) # label is gone  
str(df_bound)  
# tibble [10 × 1] (S3: tbl_df/tbl/data.frame)  
#  $ id: int [1:10] 1 2 3 4 5 1 2 3 4 5   

df_bound <- copy_labels(df_bound, df1)  
df_bound_labelled <- df_bound %>% mutate_at(vars(id), as_labelled)
str(df_bound_labelled)  
# tibble [10 × 1] (S3: tbl_df/tbl/data.frame)  
# $ id: int [1:10] 1 2 3 4 5 1 2 3 4 5  
#  ..- attr(*, "label")= chr "unit id"  
TommyFlynn
  • 11
  • 4
0

sjmisc::add_rows has similar grammar with dplyr::bind_rows, and preserves variable and value label attributes.

Paul Roub
  • 36,322
  • 27
  • 84
  • 93
hezht3
  • 11
0

In the same spirit as the answer given by @TommyFlyn, there is also the labelled package maintained by Joseph Larmarange and notably authored by Daniel Ludecke (sjlabelled) and Hadley Wickham.

Assume df.ls, a list of dataframes imported using haven::read_dta, which have an important number of variables in common, but not all of them. In that case, it is convenient to use dplyr::bind_rows which does not require to have the same variables (columns) in all the dataframes. As mentioned by the OP, the problem is that it `removes' the labels. I will add that this is only true for the variables common to the dataframes of the list. When some variables are in some dataframes but not in others, they keep their labels.

We can extract the `lost' labels using

common_col_names <- ## get the names of the columns common to all the dfs
    Reduce(intersect, lapply(df.ls,names))
library(labelled)
labs.ls <- ## a list of lists
    lapply(
        df.ls,
        function(x) {
            labelled::var_label(
                          x[, common_col_names],
                          unlist = FALSE
                      )
        }
    )

Assume that a given variable keeps the same label across all dataframes of df.ls. The use of common_col_names, along with the assumption that labels are constant, thus implies that all elements (lists) of labs.ls are the same.

Setting unlist = FALSE (the default) allows to have the different labels in a list (and not a character vector), which in in turn allows the following (from labelled documentation): ``For data frames, if value is a named list, only elements whose name will match a column of the data frame will be taken into account. If value is a character vector, labels should in the same order as the columns of the data.frame.'' This is pretty convenient if the columns/variables are not the same across all dataframes of the list.

Note that inspecting labs.ls is useful to check whether the labels actually remain the same across dataframes.

Then you simply have to bind the different dataframes from the list and to assign the extracted labels:

df <- dplyr::bind_rows(df.ls)
labelled::var_label(df) <- labs.ls[[1]]

Here we use labs.ls[[1]], but since we consider only variables common to all dataframes and we assume that labels of these variables are constant, note that 2, 3, ..., length(df.ls) could be used instead of 1.