In a dataframe, I want to turn some values to NA
. Which values should become NA
is column-dependent. This column-based value specification is given in a separate list object. I want to write a function that will accept:
- a dataframe to be cleaned up
- a vector specifying which columns to clean
- a list specifying each values are ok for each column
And will return a clean dataframe, where undesired values became NA
. While such a task can be achieved with a for
loop, I'm trying to figure out whether there's a simpler iterative way to do it. I typically like tidyverse
solutions, but will be happy with any idea.
Example Data
In the following dataset, each column has its own set of valid values that should remain, and the rest should become NA
.
library(tibble)
set.seed(2020)
## generate random strings: https://stackoverflow.com/a/42734863/6105259
sample_strings <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(letters, n, TRUE))
}
df <-
tibble(id = 1:40,
color = sample(c(1:5), size = 40, replace = TRUE),
political = sample(c(sample(c("republican", "democrat", "green_party", "libertarian"), size = 20, replace = TRUE),
sample_strings(20))),
religion_status = sample(c(sample(c("secular", "traditional", "religious", "atheist", "agnostic"), size = 20, replace = TRUE),
sample_strings(20)))
)
## # A tibble: 40 x 4
## id color political religion_status
## <int> <int> <chr> <chr>
## 1 1 4 republican fzwue3975k
## 2 2 4 republican mgxoe2220e
## 3 3 1 democrat secular
## 4 4 1 republican secular
## 5 5 4 aibcg6459y oqnfv1461t
## 6 6 2 aedqi0739y ufhua9648n
## 7 7 1 zgvox0771x agnostic
## 8 8 5 democrat traditional
## 9 9 2 republican atheist
## 10 10 2 oxgge5657l nktsl2136o
## # ... with 30 more rows
Knowing which values should be kept in which column comes from the following list:
var_mapping_list <- list(preferences = list(age = list(originType = "NumberQuestionPage",
originIndex = 6L, title = "what is your age?", valueDescriptions = NULL),
political = list(originType = "QuestionPage", originIndex = 7L,
title = "what is your political affiliation?", valueDescriptions = list(
republican = "I have voted most of my life to the republican party",
democrat = "I have voted most of my life to the democratic party",
other = "other")), religion_status = list(originType = "QuestionPage",
originIndex = 9L, title = "how do you define your religiousness level? ",
valueDescriptions = list(secular = "I don't practice any religion although I do belong to one",
traditional = "I'm observant and keep some of the practices",
religious = "I practice a religion", other = "other")),
color = list(originType = "QuestionPage", title = "which color do you like the best",
valueDescriptions = list(`1` = "red", `2` = "blue", `3` = "yellow",
`4` = "pink", `5` = "orange")), pet = list(originType = "QuestionPage",
originIndex = 0L, title = "do you have a pet? ", valueDescriptions = list(
yes = "yes", no = "no"))))
One variable for example
Say that I want to clean up df$political
. To know which values to keep, I will first go to:
var_mapping_list$preferences$political$valueDescriptions
## $republican
## [1] "I have voted most of my life to the republican party"
## $democrat
## [1] "I have voted most of my life to the democratic party"
## $other
## [1] "other"
My rule is that all options except for other
are the valid values for the corresponding column in df
.
So this means that in df$political
, only republican
and democrat
should remain, and the rest should turn NA
.
Thus, an example workflow for just df$political
would be:
library(tidyr)
library(rlang)
library(dplyr)
vec_political_values_to_keep <-
var_mapping_list$preferences$political$valueDescriptions %>%
bind_rows %>%
pivot_longer(cols = tidyselect::everything(),
names_to = "option_key",
values_to = "description") %>%
filter(option_key != "other") %>%
pull(option_key)
df %>%
mutate(political = recode(political, !!!rlang::set_names(vec_political_values_to_keep), .default = NA_character_)) ## https://stackoverflow.com/a/63916563/6105259
## # A tibble: 40 x 4
## id color political religion_status
## <int> <int> <chr> <chr>
## 1 1 4 republican fzwue3975k
## 2 2 4 republican mgxoe2220e
## 3 3 1 democrat secular
## 4 4 1 republican secular
## 5 5 4 NA oqnfv1461t
## 6 6 2 NA ufhua9648n
## 7 7 1 NA agnostic
## 8 8 5 democrat traditional
## 9 9 2 republican atheist
## 10 10 2 NA nktsl2136o
I want to scale the above to any variable of interest in df
.
Desired Output
Specifying the vector
colnames_to_clean <- c("color", "political", "religion_status")
[1] "color" "political" "religion_status"
Should return the following dataframe:
## id color political religion_status
## <int> <int> <chr> <chr>
## 1 1 4 republican NA
## 2 2 4 republican NA
## 3 3 1 democrat secular
## 4 4 1 republican secular
## 5 5 4 NA NA
## 6 6 2 NA NA
## 7 7 1 NA NA
## 8 8 5 democrat traditional
## 9 9 2 republican NA
## 10 10 2 NA NA
I'll be grateful for any help with this!