Mutating based on commonalities in names of columns

Question

Packages I suspect is needed/I was planning to use but can't get working

#Load packages
if(!("pacman" %in% .packages(all.available = T))){
    install.packages("pacman")
    library("pacman")
}else if(!("pacman" %in% (.packages()))){
    library("pacman")
}
p_load(magrittr, plyr, dplyr,
       rlang, tibble, tidyr,
       purrr)

Generate some data for this example:

#For reproducability
set.seed(1)
tib <- tibble(
ID = letters,
A_1 = runif(26),
A_2 = runif(26),
B_1 = runif(26), 
B_2 = runif(26),
B_3 = runif(26),
C_1 = runif(26),
C_2 = runif(26),
C_3 = runif(26),
C_4 = runif(26)
)
#Remove some datapoint
for(i in 2:9){
pick_rows <- sample(1:nrow(tib[i]), nrow(tib[i])*.25)
tib[pick_rows, i] <- NA
}

Then the idea of what I want to do is as follows:

For each category (add one new column for each category) and row (ID), check and flag the following:

(a) are all values NA? Flag as 'MNAR'

(b) is there some but not all values missing? Flag as 'MAR/MCAR'

(c) are there no missing values? Flag as 'Not missing'

To me, it seems that this part should be computationally cheap, but in my current approach, this is a major bottleneck in my code.

This is my current approach:

for (i in tib %>%
     #Only numeric columns contain relevant data
     keep(is.numeric) %>%
     #Get unique identifiers
     colnames() %>% gsub('[0-9]$', '', .) %>% unique()
) {
    #Generate a new column
    tib[[paste0(i, 'missing')]] <- tib %>%
        #Select the conditions columns
        select(contains(i)) %>%
        #For each row
        apply(1, function(x) x %>%
                  #Check if
        {case_when(
            #no values, (the most common event)
            all(!is.na(.)) ~ 'Not missing',
            #all values, (the least most common event)
            all(is.na(.)) ~ 'MNAR',
            #or any values (the second most common event)
            any(is.na(.)) ~ 'MAR/MCAR'
            #are missing
        )}
        )
}

and the approach I'm trying to develop as I think it will give some better speed is:

categories <- tib %>%
    keep(is.numeric) %>%
    colnames() %>%
    gsub('[0-9]$', '', .) %>%
    unique()
tib %>%
    mutate_at(
        vars(syms(grep(paste0(categories, collapse = '|'),
                       colnames(tib),
                       value = T))),
        funs(missing = case_when(
            #no values
            all(!is.na(.)) ~ 'Not missing',
            #or all values
            all(is.na(.)) ~ 'MNAR',
            #any values
            any(is.na(.)) ~ 'MAR/MCAR'
            #are missing
                                         )
                                )
            )

Which obviously doesn't work but I think it is some decent pseudo code for what I'm trying. Party it needs to call map from purrr but I can't even get mutate to identify the correct group of columns at this point (I have been working with more primitive code for that).

Searching in StackOverflow I found the following threads:

dplyr - mutate formula based on similarities in column names

Conditionally mutate columns based on column class

dplyr mutate multiple columns based on names in vectors

Mutate multiple columns in a dataframe

of which I can't say any is relevant to my question.

EDIT:

Desired output:

> tib
# A tibble: 26 x 13
   ID       A_1     A_2     B_1    B_2    B_3     C_1    C_2    C_3   C_4 A_missing  B_missing  C_missing 
   <chr>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl> <dbl> <chr>      <chr>      <chr>     
 1 a      0.266  0.0134  0.438   0.777  0.633  0.575   0.530 NA     0.256 Not missi~ Not missi~ MAR/MCAR  
 2 b      0.372  0.382   0.245   0.961  0.213 NA      NA      0.503 0.718 Not missi~ Not missi~ MAR/MCAR  
 3 c      0.573  0.870   0.0707 NA      0.129  0.0355 NA      0.877 0.961 Not missi~ MAR/MCAR   MAR/MCAR  
 4 d      0.908 NA      NA       0.713  0.478 NA      NA      0.189 0.100 MAR/MCAR   MAR/MCAR   MAR/MCAR  
 5 e      0.202 NA       0.316   0.400  0.924 NA      NA     NA     0.763 MAR/MCAR   Not missi~ MAR/MCAR  
 6 f      0.898  0.600   0.519  NA      0.599  0.598   0.895  0.724 0.948 Not missi~ MAR/MCAR   Not missi~
 7 g      0.945  0.494   0.662   0.757 NA      0.561  NA     NA     0.819 Not missi~ MAR/MCAR   MAR/MCAR  
 8 h      0.661 NA       0.407   0.203 NA      0.526   0.780  0.548 0.308 MAR/MCAR   MAR/MCAR   Not missi~
 9 i      0.629  0.827   0.913   0.711  0.357  0.985   0.881  0.712 0.650 Not missi~ Not missi~ Not missi~
10 j     NA     NA       0.294   0.122 NA      0.508  NA      0.389 0.953 MNAR       MAR/MCAR   MAR/MCAR  
# ... with 16 more rows

akrun · Accepted Answer · 2018-11-18T23:58:43.713

One option would be split and then use map/pmap

library(tidyverse)
f1 <- function(x) case_when(all(!is.na(x)) ~ "Not missing",
               all(is.na(x)) ~ "MNAR", 
               any(is.na(x)) ~ "MAR/MCAR")
tib %>% 
    keep(is.numeric) %>%
    split.default(str_remove(names(.), '_\\d+')) %>%
    map_df(~ .x %>% 
                pmap_chr(~ f1(c(...)))) %>%
    rename_all(~ paste0(., '_missing')) %>% 
    bind_cols(tib, .)
# A tibble: 26 x 13
#   ID       A_1     A_2     B_1    B_2    B_3     C_1    C_2    C_3   C_4 A_missing   B_missing   C_missing  
#   <chr>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl> <dbl> <chr>       <chr>       <chr>      
# 1 a      0.266  0.0134  0.438   0.777  0.633  0.575   0.530 NA     0.256 Not missing Not missing MAR/MCAR   
# 2 b      0.372  0.382   0.245   0.961  0.213 NA      NA      0.503 0.718 Not missing Not missing MAR/MCAR   
# 3 c      0.573  0.870   0.0707 NA      0.129  0.0355 NA      0.877 0.961 Not missing MAR/MCAR    MAR/MCAR   
# 4 d      0.908 NA      NA       0.713  0.478 NA      NA      0.189 0.100 MAR/MCAR    MAR/MCAR    MAR/MCAR   
# 5 e      0.202 NA       0.316   0.400  0.924 NA      NA     NA     0.763 MAR/MCAR    Not missing MAR/MCAR   
# 6 f      0.898  0.600   0.519  NA      0.599  0.598   0.895  0.724 0.948 Not missing MAR/MCAR    Not missing
# 7 g      0.945  0.494   0.662   0.757 NA      0.561  NA     NA     0.819 Not missing MAR/MCAR    MAR/MCAR   
# 8 h      0.661 NA       0.407   0.203 NA      0.526   0.780  0.548 0.308 MAR/MCAR    MAR/MCAR    Not missing
# 9 i      0.629  0.827   0.913   0.711  0.357  0.985   0.881  0.712 0.650 Not missing Not missing Not missing
#10 j     NA     NA       0.294   0.122 NA      0.508  NA      0.389 0.953 MNAR        MAR/MCAR    MAR/MCAR   
# ... with 16 more rows

Or another option is to gather into 'long' format and then spread it back after applying the function f1 to create the new column

tib %>%
  gather(key, val, -ID) %>%
  separate(key, into = c('key1', 'key2')) %>% 
  group_by(ID, key1) %>%
  mutate(missing = f1(val)) %>% 
  select(-val, -key2) %>%
  distinct() %>%
  spread(key1, missing) %>% 
  rename_at(vars(A:C), ~ paste0(., '_missing')) %>% 
  left_join(tib, .)

@Baraliuh That is okay for `gather/spread` as we will be grouping by the columns 'A', 'B', 'C' after the `separate` step — akrun, Nov 18 '18 at 22:58
Changing `tib[-1]` to `tib %>% keep(is.numeric)` will generalize it a bit. Is it possible to get the '.*_missing' flag instead of just '.*' (where .* is the name of the category)? As a side note, microbenchmark between your method and mine shows that yours is about twice as fast! Thank you very much. — Baraliuh, Nov 18 '18 at 23:44
@Baraliuh Regarding your second comment, I have used `rename_all`. Have you checked that — akrun, Nov 18 '18 at 23:47
I have used rename, but never used rename_all (or at least found to oppertuinty to use)! A nice addition to my R vocabulary; thank you for introducing it to me! :) I generally like your approach using split, didin't think of that one, another thing to look for! split+map, nice combo! — Baraliuh, Nov 18 '18 at 23:50
As a side note: the speed of the functions are Your first function<< My function< Your last function. Benchmark looked horrid in a comment so I didn't include it. — Baraliuh, Nov 19 '18 at 00:06
@Baraliuh It would be less efficient as we are doing some conversions to 'long' and then to wide, but I just wanted to mention this method (probably good for small dataset) — akrun, Nov 19 '18 at 00:48

Mutating based on commonalities in names of columns

1 Answers1