-1

I have a dataframe with about 5000 columns and 1000 rows, containing coded data (factors). As it will be too tedious to define every labels (about 20000), is there a way in R to replace the codes by the label text and the variables by the name, previously defined in a library?

1- How to decode each Variable (replace by the variable name text from the library as an excel file)? 2- How to decode data (replace coded data by the label)?

Here examples of Data and library I could have: libray

Data

Thank you!

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Please provide enough code so others can better understand or reproduce the problem. – Community Apr 13 '23 at 15:08
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Please [do not post code or data in images](https://meta.stackoverflow.com/q/285551/2372064) – MrFlick Apr 13 '23 at 15:53

1 Answers1

0

To start with, I define a toy example of what I assume you would like to achieve:

Given

## Toy Library
library <- data.frame(variable = c("V1", "V1", "V2", "V2", "V2"),
                      name = c("var name 1", "var name 1", "var name 2", "var name 2", "var name 2"),
                      code = c(1,2,1,2,3),
                      label = c("x1", "x2", "A", "B", "C"))

## Toy Data
data <- data.frame(Subj = seq(1,10),
                   V1 = sample(c(1,2), 10, replace = TRUE),
                   V2 = sample(c(1,2,3), 10, replace = TRUE))

you would like to have a new data frame data_recoded:

data_recoded
#    Subj var name 1 var name 2
# 1     1         x1          C
# 2     2         x2          C
# 3     3         x2          B
# 4     4         x2          C
# 5     5         x2          C
# 6     6         x1          B
# 7     7         x1          B
# 8     8         x1          C
# 9     9         x1          C
# 10   10         x1          A

? (Note that this new data frame contains non-syntactic variable names.)

R provides the "factor" data type to conveniently work with coded categorical data. By converting each column in your data frame to a factor, you can "recode" the codes with labels.

First, you need to extract the factor levels and labels for each variable from your library, e.g. with (using the package dplyr):

library(dplyr)

## Create dictionary of factor levels and labels from toy library
dictionary <- library %>%
  group_by(variable) %>%
  summarise(codes = list(code),
            levels = list(label),
            names = unique(name))

Next, you need a function that recodes a single data frame column, which you can later apply on all columns. For example:

## Function to replace codes with labels
recode_variable <- function(var) {
  ## Get column 'var' from data frame
  column <- data[, var]
  ## Get factor levels and labels for variable 'var'
  varlevels <- dictionary %>% filter(variable == var)
  ## Create new data frame from column as factor
  df <- data.frame(name = factor(column,
                                 levels = varlevels$codes[[1]],
                                 labels = varlevels$levels[[1]])
                   )
  ## Replace column name with name from dictionary
  names(df) <- varlevels$names
  ## Return data frame
  return(df)
}

And finally, you can extract the names of the columns you want to recode

## Get original names of columns to be recoded
columnnames <- names(data) %>% setdiff("Subj")

and use the map_dfc() function from the package purrr to apply the function to all selected columns and cbind the resulting data frame columns together:

library(purrr)

## Recode columns
data_recoded <- columnnames %>%
  ## Apply function to all variables in columnames, cbind columns to new data frame
  map_dfc(~ recode_variable(.)) %>%
  ## Add Subject ID back as first column
  mutate(Subj = data$Subj, .before = 1)

data_recoded
#    Subj var name 1 var name 2
# 1     1         x1          C
# 2     2         x2          C
# 3     3         x2          B
# 4     4         x2          C
# 5     5         x2          C
# 6     6         x1          B
# 7     7         x1          B
# 8     8         x1          C
# 9     9         x1          C
# 10   10         x1          A
DrEspresso
  • 211
  • 5