5

I have a dataframe with questionnaire response labels. I always like to make a tibble with item-answer definitions and then use dplyr::recode() to replace all item labels with their corresponding definitions. For ease of use the definitions tibble recode_df stores these correspondences as strings and within dplyr::recode() they can be unpacked with bangbangbang !!! and evaluated. In the following toy example there are 4 items, two for qa and two for qb that share the same answer definitions.

library(tidyverse)
set.seed(42)

# columns starting with `qa` and `qb` share the same answer structure 
data_df <- tibble(
  qa_1 = sample(c(0, 1), 5, replace = TRUE),
  qa_2 = sample(c(0, 1), 5, replace = TRUE),
  qb_1 = sample(1:5, 5, replace = TRUE),
  qb_3 = sample(1:5, 5, replace = TRUE)
)

# `answer` column stores string definitions for use with `dplyr::recode()`
recode_df <- tibble(
  question = c("qa", "qb"),
  answer = c(
    'c("0" = "foo0", "1" = "foo1")',
    'c("1" = "bar1", "2" = "bar2", "3" = "bar3", "4" = "bar5", "5" = "bar5")'
  )
)  

# Desired result
data_df %>%
  mutate(
    across(
      .cols = starts_with("qa"),
      .fns = ~recode(., !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, "qa")])))
    ),
    across(
      .cols = starts_with("qb"),
      .fns = ~recode(., !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, "qb")])))
    )
  )
#> # A tibble: 5 x 4
#>   qa_1  qa_2  qb_1  qb_3 
#>   <chr> <chr> <chr> <chr>
#> 1 foo0  foo1  bar5  bar2 
#> 2 foo0  foo1  bar1  bar3 
#> 3 foo0  foo1  bar5  bar1 
#> 4 foo0  foo0  bar5  bar1 
#> 5 foo1  foo1  bar2  bar3

Created on 2023-02-26 with reprex v2.0.2

I can reach my desired result by using one mutate() and across for each row of recode_df, but I am sure there is an elegant purrr solution that iterates and recodes without repeating code. Thank you.

Claudiu Papasteri
  • 2,469
  • 1
  • 17
  • 30

2 Answers2

2

There are a number of alternatives to consider, especially if storing your answer key in a different form. However, given the present data.frames, you could try the following. Using map_dfc to column-bind your end result. You can apply your recoding function to each element of a vector of character values, such as "qa" and "qb". Let me know if this helps.

library(tidyverse)

map_dfc(
  recode_df$question,
  \(x) {
    map(
      select(data_df, contains(x)),
      \(y) recode(y, !!!eval(parse(text = recode_df$answer[str_detect(recode_df$question, x)])))
    )
  }
)

Output

  qa_1  qa_2  qb_1  qb_3 
  <chr> <chr> <chr> <chr>
1 foo0  foo1  bar5  bar2 
2 foo0  foo1  bar1  bar3 
3 foo0  foo1  bar5  bar1 
4 foo0  foo0  bar5  bar1 
5 foo1  foo1  bar2  bar3 
Ben
  • 28,684
  • 5
  • 23
  • 45
  • Thank you for the answer, it seems to work very well, although any other columns in the dataframe is dropped and needs to be cbinded back to it after. I am very interested in your suggestion regarding storing answer key in a different form. What would you say would ease this workflow? I am aware the question is off topic, but If you could give me any pointers, maybe I could post another, better question. Thank you again. – Claudiu Papasteri Feb 26 '23 at 22:25
  • 1
    @ClaudiuPapasteri It's a good question - my personal preference is to have your question/data data.frame in long (not wide) form, with only a few columns such as "question_type" (e.g., qa or qb that share a common recode), "item_number" (counter, as in 1, 2, 3...), and "item_response"...then, in your answer key data.frame, you want "question_type", "item_response", and "recoded_response"...then, just do a join (by both question_type and item_response) to recode your responses...I also tend to keep "both" raw and recoded responses for quality checking... – Ben Feb 27 '23 at 02:12
1

You can have that cheaper.

data_df[] <- lapply(names(data_df), \(x) if (grepl('qa', x)) paste0('foo', data_df[[x]]) else paste0('bar', data_df[[x]]))

If there are much more columns, you can use a simple dictionary dc consisting of a named vector with data prefixes as elements and column prefixes as names.

dc <- c(qa='foo', qb='bar')
## alternatively using `grep` to identify columns
# dc <- setNames(c('foo', 'bar'), unique(gsub('_\\d+$', '', names(data_df))))

We can now feed startsWith with names column name and dc-name to identify the correct entry in dc.

data_df[] <- lapply(names(data_df), \(x) paste0(dc[startsWith(x, names(dc))], data_df[[x]]))

data_df
#   qa_1 qa_2 qb_1 qb_3
# 1 foo0 foo1 bar4 bar2
# 2 foo0 foo1 bar1 bar3
# 3 foo0 foo1 bar5 bar1
# 4 foo0 foo0 bar4 bar1
# 5 foo1 foo1 bar2 bar3

This should also work well with hundreds of columns. It might be hard to avoid to define the translation once.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Yea, thanks for this, of course you can have it cheaper. It also works easy with for loops but I'd like a `tidyverse` implementation. More so, your `if else` implementation works for this toy example but `qa` and `qb` are only two questionnaires here out of potentially hundreds in real data, so that would mean a lot of `else`'s, which is as bad as a lot of `across`'s in my view. – Claudiu Papasteri Feb 26 '23 at 14:18
  • @ClaudiuPapasteri Yeah, really bad, you're right about that, I added an update that uses a small dictionary. In tidyvert dialects I am not very interested, sorry. – jay.sf Feb 26 '23 at 16:04