4

Struggling to find an elegant solution to this...

I have responses to "please select all that apply" questions, where each of choices A to F per question are coded as binary variables. So, for example, the first responder in the fake dataset below ticked only D for question 1, then A,C,D and E for question 2.

library(dplyr)

cols <- paste0('foo', '_', c(1:2, '3a', '3b')) %>%
  lapply(\(i) paste0(i, '_', LETTERS[1:6])) %>%
  unlist()

set.seed(1)
df <- lapply(cols, \(i) i = sample(0:1, 5, replace = TRUE)) %>%
  setNames(cols) %>%
  data.frame()

'data.frame':   5 obs. of  24 variables:
 $ foo_1_A : int  0 1 0 0 1
 $ foo_1_B : int  0 0 0 1 1
 $ foo_1_C : int  0 0 0 0 0
 $ foo_1_D : int  1 1 1 1 0
 $ foo_1_E : int  0 0 0 0 0
 $ foo_1_F : int  0 1 0 0 1
 $ foo_2_A : int  1 1 0 1 0
 $ foo_2_B : int  0 1 0 1 1
 $ foo_2_C : int  1 1 0 1 1
 $ foo_2_D : int  1 1 1 0 0
 $ foo_2_E : int  1 0 1 1 0
 $ foo_2_F : int  0 1 1 1 0
 $ foo_3a_A: int  0 1 1 1 1
 $ foo_3a_B: int  1 1 0 1 1
 $ foo_3a_C: int  1 1 0 0 0
 $ foo_3a_D: int  1 1 0 0 1
 $ foo_3a_E: int  1 1 0 0 0
 $ foo_3a_F: int  1 0 1 0 1
 $ foo_3b_A: int  0 0 1 1 0
 $ foo_3b_B: int  0 0 1 1 0
 $ foo_3b_C: int  1 1 1 0 0
 $ foo_3b_D: int  0 0 1 0 0
 $ foo_3b_E: int  0 0 0 1 1
 $ foo_3b_F: int  1 1 0 1 1

What I want is to recode 1 to each column's choice-letter (A, B, C, D, E, or F) and to concatenate the choices for each question so that I have something like this:

foo_1  D     ADF   D     BD    ABF
foo_2  ACDE  ABCDF DEF   ABCEF BC
foo_3a BCDEF ABCDE AF    AB    ABDF
foo_3b CF    CF    ABCD  ABEF  EF

This is as far as I got before realising I'd get stuck repeating similar code over and over:

df <- df %>% mutate(across(
  starts_with('foo') & ends_with('A'),
  ~ recode(., `1` = 'A', .default = NA_character_)
))
jatx50
  • 65
  • 4
  • For the recoding step, I would use this: `df %>% mutate(across(everything(), ~ifelse(.x==1, str_extract(deparse(substitute(.x)), "[A-Z]$"), NA)))`. Not so sure how to go from there – GuedesBF Sep 14 '21 at 00:17

2 Answers2

2

One option is to reshape to long format with pivot_longer and then do a grouping by the sequence column generated earlier to summarise across the 'foo' columns by converting it to logical from binary, subset the 'grp' column and paste(str_c) them together

library(dplyr)
library(tidyr)
library(stringr)
df %>%
    mutate(rn = row_number()) %>%
    pivot_longer(cols = -rn, names_to = c(".value", "grp"), 
          names_pattern = "^(.*_.*)_(.*)") %>% 
    group_by(rn) %>%
    summarise(across(-grp, ~ str_c(grp[as.logical(.)], 
          collapse="")), .groups = 'drop') %>% 
    select(-rn)

-output

# A tibble: 5 x 4
  foo_1 foo_2 foo_3a foo_3b
  <chr> <chr> <chr>  <chr> 
1 D     ACDE  BCDEF  CF    
2 ADF   ABCDF ABCDE  CF    
3 D     DEF   AF     ABCD  
4 BD    ABCEF AB     ABEF  
5 ABF   BC    ABDF   EF 

Or another option is

library(purrr)
df %>% 
   summarise(across(everything(), ~case_when(as.logical(.) ~ 
        rep(str_remove(cur_column(), ".*_.*_"), n())))) %>% 
   split.default(str_remove(names(.), "_[^_]+$")) %>%
   map_dfc(~ .x %>%
        unite(new, everything(), na.rm = TRUE, sep="") %>% 
        pull(new)) 
# A tibble: 5 x 4
  foo_1 foo_2 foo_3a foo_3b
  <chr> <chr> <chr>  <chr> 
1 D     ACDE  BCDEF  CF    
2 ADF   ABCDF ABCDE  CF    
3 D     DEF   AF     ABCD  
4 BD    ABCEF AB     ABEF  
5 ABF   BC    ABDF   EF    

Or using base R

sapply(split.default(df, sub("(.*_.*)_.*", "\\1", names(df))), 
    function(x) apply(x, 1, FUN= function(y) paste(sub(".*_", "", 
     names(y))[as.logical(y)], collapse="")))
    foo_1 foo_2   foo_3a  foo_3b
[1,] "D"   "ACDE"  "BCDEF" "CF"  
[2,] "ADF" "ABCDF" "ABCDE" "CF"  
[3,] "D"   "DEF"   "AF"    "ABCD"
[4,] "BD"  "ABCEF" "AB"    "ABEF"
[5,] "ABF" "BC"    "ABDF"  "EF"  
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! The first of your 3 solutions makes sense to me. Except, the only thing I don't quite understand is the `.groups = 'drop'` part of `across`—what is this doing? – jatx50 Sep 15 '21 at 09:07
  • @jatx50 It is related to the `summarise` dropping groups attribute. It is mentioned detailed in [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) – akrun Sep 15 '21 at 16:25
1

Another option is to use dplyr::rowwise together with dplyover::over (disclaimer: I'm the maintainer of {dplyover}). dplyover::cut_names allows us to select the string parts of the column names that we need. Then we can use this inside across to get those parts of df that we need to then get the names and subset them with the rowwise data as.logical. Finally we need to replace the names so that only the last letter remains.

library(tidyverse)
library(dplyover) # https://github.com/TimTeaFan/dplyover

df %>% rowwise %>% 
  summarise(over(cut_names("_\\w$"), ~ 
    unlist(across(starts_with(.x))) %>% 
    {names(.)[as.logical(.)]} %>%  
    {paste(gsub(paste0(.x, "_"), "", .), collapse = "")}
  )) 

#> # A tibble: 5 x 4
#>   foo_1 foo_2 foo_3a foo_3b
#>   <chr> <chr> <chr>  <chr> 
#> 1 D     ACDE  BCDEF  CF    
#> 2 ADF   ABCDF ABCDE  CF    
#> 3 D     DEF   AF     ABCD  
#> 4 BD    ABCEF AB     ABEF  
#> 5 ABF   BC    ABDF   EF

Created on 2021-09-14 by the reprex package (v2.0.1)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • Thanks for this alternate solution. I already have `tidyverse` in my project so I've decided to use one of the above solutions. – jatx50 Sep 15 '21 at 09:08