recoding and collapasing multiple binary-coded columns

Question

Struggling to find an elegant solution to this...

I have responses to "please select all that apply" questions, where each of choices A to F per question are coded as binary variables. So, for example, the first responder in the fake dataset below ticked only D for question 1, then A,C,D and E for question 2.

library(dplyr)

cols <- paste0('foo', '_', c(1:2, '3a', '3b')) %>%
  lapply(\(i) paste0(i, '_', LETTERS[1:6])) %>%
  unlist()

set.seed(1)
df <- lapply(cols, \(i) i = sample(0:1, 5, replace = TRUE)) %>%
  setNames(cols) %>%
  data.frame()

'data.frame':   5 obs. of  24 variables:
 $ foo_1_A : int  0 1 0 0 1
 $ foo_1_B : int  0 0 0 1 1
 $ foo_1_C : int  0 0 0 0 0
 $ foo_1_D : int  1 1 1 1 0
 $ foo_1_E : int  0 0 0 0 0
 $ foo_1_F : int  0 1 0 0 1
 $ foo_2_A : int  1 1 0 1 0
 $ foo_2_B : int  0 1 0 1 1
 $ foo_2_C : int  1 1 0 1 1
 $ foo_2_D : int  1 1 1 0 0
 $ foo_2_E : int  1 0 1 1 0
 $ foo_2_F : int  0 1 1 1 0
 $ foo_3a_A: int  0 1 1 1 1
 $ foo_3a_B: int  1 1 0 1 1
 $ foo_3a_C: int  1 1 0 0 0
 $ foo_3a_D: int  1 1 0 0 1
 $ foo_3a_E: int  1 1 0 0 0
 $ foo_3a_F: int  1 0 1 0 1
 $ foo_3b_A: int  0 0 1 1 0
 $ foo_3b_B: int  0 0 1 1 0
 $ foo_3b_C: int  1 1 1 0 0
 $ foo_3b_D: int  0 0 1 0 0
 $ foo_3b_E: int  0 0 0 1 1
 $ foo_3b_F: int  1 1 0 1 1

What I want is to recode 1 to each column's choice-letter (A, B, C, D, E, or F) and to concatenate the choices for each question so that I have something like this:

foo_1  D     ADF   D     BD    ABF
foo_2  ACDE  ABCDF DEF   ABCEF BC
foo_3a BCDEF ABCDE AF    AB    ABDF
foo_3b CF    CF    ABCD  ABEF  EF

This is as far as I got before realising I'd get stuck repeating similar code over and over:

df <- df %>% mutate(across(
  starts_with('foo') & ends_with('A'),
  ~ recode(., `1` = 'A', .default = NA_character_)
))

For the recoding step, I would use this: `df %>% mutate(across(everything(), ~ifelse(.x==1, str_extract(deparse(substitute(.x)), "[A-Z]$"), NA)))`. Not so sure how to go from there — GuedesBF, Sep 14 '21 at 00:17

akrun · Accepted Answer · 2021-09-14T00:36:41.107

One option is to reshape to long format with pivot_longer and then do a grouping by the sequence column generated earlier to summarise across the 'foo' columns by converting it to logical from binary, subset the 'grp' column and paste(str_c) them together

library(dplyr)
library(tidyr)
library(stringr)
df %>%
    mutate(rn = row_number()) %>%
    pivot_longer(cols = -rn, names_to = c(".value", "grp"), 
          names_pattern = "^(.*_.*)_(.*)") %>% 
    group_by(rn) %>%
    summarise(across(-grp, ~ str_c(grp[as.logical(.)], 
          collapse="")), .groups = 'drop') %>% 
    select(-rn)

-output

# A tibble: 5 x 4
  foo_1 foo_2 foo_3a foo_3b
  <chr> <chr> <chr>  <chr> 
1 D     ACDE  BCDEF  CF    
2 ADF   ABCDF ABCDE  CF    
3 D     DEF   AF     ABCD  
4 BD    ABCEF AB     ABEF  
5 ABF   BC    ABDF   EF

Or another option is

library(purrr)
df %>% 
   summarise(across(everything(), ~case_when(as.logical(.) ~ 
        rep(str_remove(cur_column(), ".*_.*_"), n())))) %>% 
   split.default(str_remove(names(.), "_[^_]+$")) %>%
   map_dfc(~ .x %>%
        unite(new, everything(), na.rm = TRUE, sep="") %>% 
        pull(new)) 
# A tibble: 5 x 4
  foo_1 foo_2 foo_3a foo_3b
  <chr> <chr> <chr>  <chr> 
1 D     ACDE  BCDEF  CF    
2 ADF   ABCDF ABCDE  CF    
3 D     DEF   AF     ABCD  
4 BD    ABCEF AB     ABEF  
5 ABF   BC    ABDF   EF

Or using base R

sapply(split.default(df, sub("(.*_.*)_.*", "\\1", names(df))), 
    function(x) apply(x, 1, FUN= function(y) paste(sub(".*_", "", 
     names(y))[as.logical(y)], collapse="")))
    foo_1 foo_2   foo_3a  foo_3b
[1,] "D"   "ACDE"  "BCDEF" "CF"  
[2,] "ADF" "ABCDF" "ABCDE" "CF"  
[3,] "D"   "DEF"   "AF"    "ABCD"
[4,] "BD"  "ABCEF" "AB"    "ABEF"
[5,] "ABF" "BC"    "ABDF"  "EF"

Thanks! The first of your 3 solutions makes sense to me. Except, the only thing I don't quite understand is the `.groups = 'drop'` part of `across`—what is this doing? — jatx50, Sep 15 '21 at 09:07
@jatx50 It is related to the `summarise` dropping groups attribute. It is mentioned detailed in [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) — akrun, Sep 15 '21 at 16:25

TimTeaFan · Answer 2 · 2021-09-15T09:12:25.800

Another option is to use dplyr::rowwise together with dplyover::over (disclaimer: I'm the maintainer of {dplyover}). dplyover::cut_names allows us to select the string parts of the column names that we need. Then we can use this inside across to get those parts of df that we need to then get the names and subset them with the rowwise data as.logical. Finally we need to replace the names so that only the last letter remains.

library(tidyverse)
library(dplyover) # https://github.com/TimTeaFan/dplyover

df %>% rowwise %>% 
  summarise(over(cut_names("_\\w$"), ~ 
    unlist(across(starts_with(.x))) %>% 
    {names(.)[as.logical(.)]} %>%  
    {paste(gsub(paste0(.x, "_"), "", .), collapse = "")}
  )) 

#> # A tibble: 5 x 4
#>   foo_1 foo_2 foo_3a foo_3b
#>   <chr> <chr> <chr>  <chr> 
#> 1 D     ACDE  BCDEF  CF    
#> 2 ADF   ABCDF ABCDE  CF    
#> 3 D     DEF   AF     ABCD  
#> 4 BD    ABCEF AB     ABEF  
#> 5 ABF   BC    ABDF   EF

^{Created on 2021-09-14 by the reprex package (v2.0.1)}

Thanks for this alternate solution. I already have `tidyverse` in my project so I've decided to use one of the above solutions. — jatx50, Sep 15 '21 at 09:08

recoding and collapasing multiple binary-coded columns

2 Answers2