0

In a dataframe, I want to turn some values to NA. Which values should become NA is column-dependent. This column-based value specification is given in a separate list object. I want to write a function that will accept:

  1. a dataframe to be cleaned up
  2. a vector specifying which columns to clean
  3. a list specifying each values are ok for each column

And will return a clean dataframe, where undesired values became NA. While such a task can be achieved with a for loop, I'm trying to figure out whether there's a simpler iterative way to do it. I typically like tidyverse solutions, but will be happy with any idea.

Example Data

In the following dataset, each column has its own set of valid values that should remain, and the rest should become NA.

library(tibble)

set.seed(2020)

## generate random strings: https://stackoverflow.com/a/42734863/6105259
sample_strings <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(letters, n, TRUE))
}

df <-
  tibble(id = 1:40,
         color = sample(c(1:5), size = 40, replace = TRUE),
         political = sample(c(sample(c("republican", "democrat", "green_party", "libertarian"), size = 20, replace = TRUE),
                              sample_strings(20))),
         religion_status = sample(c(sample(c("secular", "traditional", "religious", "atheist", "agnostic"), size = 20, replace = TRUE), 
                                    sample_strings(20)))
         )

## # A tibble: 40 x 4
##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican fzwue3975k     
##  2     2     4 republican mgxoe2220e     
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 aibcg6459y oqnfv1461t     
##  6     6     2 aedqi0739y ufhua9648n     
##  7     7     1 zgvox0771x agnostic       
##  8     8     5 democrat   traditional    
##  9     9     2 republican atheist        
## 10    10     2 oxgge5657l nktsl2136o     
## # ... with 30 more rows

Knowing which values should be kept in which column comes from the following list:

var_mapping_list <- list(preferences = list(age = list(originType = "NumberQuestionPage", 
    originIndex = 6L, title = "what is your age?", valueDescriptions = NULL), 
    political = list(originType = "QuestionPage", originIndex = 7L, 
        title = "what is your political affiliation?", valueDescriptions = list(
            republican = "I have voted most of my life to the republican party", 
            democrat = "I have voted most of my life to the democratic party", 
            other = "other")), religion_status = list(originType = "QuestionPage", 
        originIndex = 9L, title = "how do you define your religiousness level? ", 
        valueDescriptions = list(secular = "I don't practice any religion although I do belong to one", 
            traditional = "I'm observant and keep some of the practices", 
            religious = "I practice a religion", other = "other")), 
    color = list(originType = "QuestionPage", title = "which color do you like the best", 
        valueDescriptions = list(`1` = "red", `2` = "blue", `3` = "yellow", 
            `4` = "pink", `5` = "orange")), pet = list(originType = "QuestionPage", 
        originIndex = 0L, title = "do you have a pet? ", valueDescriptions = list(
            yes = "yes", no = "no"))))

One variable for example

Say that I want to clean up df$political. To know which values to keep, I will first go to:

var_mapping_list$preferences$political$valueDescriptions

## $republican
## [1] "I have voted most of my life to the republican party"

## $democrat
## [1] "I have voted most of my life to the democratic party"

## $other
## [1] "other"

My rule is that all options except for other are the valid values for the corresponding column in df.

So this means that in df$political, only republican and democrat should remain, and the rest should turn NA.

Thus, an example workflow for just df$political would be:

library(tidyr)
library(rlang)
library(dplyr)

vec_political_values_to_keep <-
  var_mapping_list$preferences$political$valueDescriptions %>%
  bind_rows %>%
  pivot_longer(cols = tidyselect::everything(), 
               names_to = "option_key", 
               values_to = "description") %>%
  filter(option_key != "other") %>%
  pull(option_key)
 

df %>% 
  mutate(political = recode(political, !!!rlang::set_names(vec_political_values_to_keep), .default = NA_character_)) ## https://stackoverflow.com/a/63916563/6105259


## # A tibble: 40 x 4
##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican fzwue3975k     
##  2     2     4 republican mgxoe2220e     
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 NA         oqnfv1461t     
##  6     6     2 NA         ufhua9648n     
##  7     7     1 NA         agnostic       
##  8     8     5 democrat   traditional    
##  9     9     2 republican atheist        
## 10    10     2 NA         nktsl2136o  

I want to scale the above to any variable of interest in df.

Desired Output

Specifying the vector

colnames_to_clean <- c("color", "political", "religion_status")

[1] "color"           "political"       "religion_status"

Should return the following dataframe:

##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican NA             
##  2     2     4 republican NA             
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 NA         NA             
##  6     6     2 NA         NA             
##  7     7     1 NA         NA             
##  8     8     5 democrat   traditional    
##  9     9     2 republican NA             
## 10    10     2 NA         NA  

I'll be grateful for any help with this!

Emman
  • 3,695
  • 2
  • 20
  • 44
  • To be honest, your question is too long and includes extraneous information. Your real question is how to parse the survey data in a semi-automated way to recode the answers. But it's not really clear what you really want to extract from `var_mapping_list`, because your example for it doesn't even have green_party, but elsewhere in your code you do. Try to focus your question more. – Michael Dewar Dec 16 '20 at 15:52
  • Thank you for the comment. I've tried revising the question several times to make it clearer, but probably not enough. `green_party` *does* appear in the data, but *because* it's not in `var_mapping_list`, we know that `green_party` doesn't belong to the data and should therefore become `NA`. The entire purpose of `var_mapping_list` is to let us know which values should remain in `df`. – Emman Dec 16 '20 at 16:58
  • I've edited once again, I hope it's clearer. I'd appreciate any feedback if it requires further clarifications. – Emman Dec 16 '20 at 17:24

1 Answers1

1

Here is one possibility. First you put the valid values in a tibble.

new_list <- tibble(
  name  = names(var_mapping_list$preferences),
  x = var_mapping_list$preferences
) %>%
  mutate(all_vals = map2(x, name, ~ names(.x$valueDescriptions))) %>%
  select(-x)

This has the advantage that you now can easily work with valid values inside the tidyverse. Second, you join the valid values and check if the current value is a valid value:

df %>%
  gather(name, val, -id) %>%
  left_join(new_list, by = "name") %>% 
  group_by(name) %>%
  mutate(val = map2_chr(val, all_vals, ~if_else(.x %in% setdiff(.y, "other"), .x, NA_character_))) %>%
  select(-all_vals) %>%
  spread(name, val)

# A tibble: 40 x 4
      id color political  religion_status
   <int> <chr> <chr>      <chr>          
 1     1 4     republican NA             
 2     2 4     republican NA             
 3     3 1     democrat   secular        
 4     4 1     republican secular        
 5     5 4     NA         NA             
 6     6 2     NA         NA             
 7     7 1     NA         NA             
 8     8 5     democrat   traditional    
 9     9 2     republican NA             
10    10 2     NA         NA             
# ... with 30 more rows
Cettt
  • 11,460
  • 7
  • 35
  • 58