In the same spirit as the answer given by @TommyFlyn, there is also the labelled
package maintained by Joseph Larmarange and notably authored by Daniel Ludecke (sjlabelled
) and Hadley Wickham.
Assume df.ls
, a list of dataframes imported using haven::read_dta
, which have an important number of variables in common, but not all of them. In that case, it is convenient to use dplyr::bind_rows
which does not require to have the same variables (columns) in all the dataframes. As mentioned by the OP, the problem is that it `removes' the labels. I will add that this is only true for the variables common to the dataframes of the list. When some variables are in some dataframes but not in others, they keep their labels.
We can extract the `lost' labels using
common_col_names <- ## get the names of the columns common to all the dfs
Reduce(intersect, lapply(df.ls,names))
library(labelled)
labs.ls <- ## a list of lists
lapply(
df.ls,
function(x) {
labelled::var_label(
x[, common_col_names],
unlist = FALSE
)
}
)
Assume that a given variable keeps the same label across all dataframes of df.ls
. The use of common_col_names
, along with the assumption that labels are constant, thus implies that all elements (lists) of labs.ls
are the same.
Setting unlist = FALSE
(the default) allows to have the different labels in a list (and not a character vector), which in in turn allows the following (from labelled
documentation): ``For data frames, if value is a named list, only elements whose name will match a column of the data frame will be taken into account. If value is a character vector, labels should in the same order as the columns of the data.frame.'' This is pretty convenient if the columns/variables are not the same across all dataframes of the list.
Note that inspecting labs.ls
is useful to check whether the labels actually remain the same across dataframes.
Then you simply have to bind the different dataframes from the list and to assign the extracted labels:
df <- dplyr::bind_rows(df.ls)
labelled::var_label(df) <- labs.ls[[1]]
Here we use labs.ls[[1]]
, but since we consider only variables common to all dataframes and we assume that labels of these variables are constant, note that 2, 3, ..., length(df.ls)
could be used instead of 1.