Dealing with misclassified data using ifelse in R

Question

I'm not sure exactly how to frame this, but I'll illustrate it. Essentially I have a survey where I ask participants which diet drinks they drink. But some participants have mistakingly put water as a diet drink, throwing off the results. The way the data is framed participants first answer "type of diet drink" (diet_drinks), then "how many per day?" (diet_drinks_amt), then "how many days per week?" (diet_drinks_amt).

What I want to do is to recode data in the diet_drinks_amt and diet_drinks_days columns to read NA where a participant put "Water" for the diet_drinks question, without removing the data for other participants who answered correctly. I know about ifelse but what I don't know how to do is recode the mistakes without erasing the correct data.

dataset <- data.frame(
  diet_drinks = c("Diet Coke", "Diet Sprite", "Water"),
  diet_drinks_amt = c(2,7,3),
  diet_drinks_days = c(3,6,7)
  )

One way I've got around the issue is to create a new variable that codes "Water" as 0 and everything else as 1, and then multiplying this variable by the diet_drinks_amt and diet_drinks_days variables, but I'm sure there must be a simpler way. Thanks in advance :)

r2evans · Accepted Answer · 2020-08-01T21:12:08.427

ind <- tolower(dataset$diet_drinks) == "water"
ind
# [1] FALSE FALSE  TRUE
dataset[ind, -1] <- NA
dataset
#   diet_drinks diet_drinks_amt diet_drinks_days
# 1   Diet Coke               2                3
# 2 Diet Sprite               7                6
# 3       Water              NA               NA

(I made an assumption that lower-case "water" should also be a match, I wasn't certain if it's feasible. tolower is optional, of course.)

If you need to target specific columns, you can increase the specificity with numeric or name reference:

dataset[ind,2:3] <- NA
dataset[ind,c("diet_drinks_amt", "diet_drinks_days")] <- NA
dataset[ind,startsWith(names(dataset), "diet_drinks_")] <- NA

tidyverse

library(dplyr)
dataset %>%
  mutate_at(vars(starts_with("diet_drinks_")), ~ if_else(diet_drinks == "Water", NA_real_, .))
#   diet_drinks diet_drinks_amt diet_drinks_days
# 1   Diet Coke               2                3
# 2 Diet Sprite               7                6
# 3       Water              NA               NA

data.table

library(data.table)
as.data.table(dataset)[ diet_drinks == "Water", grep("^diet_drinks_", names(dataset), value = TRUE) := NA_real_, ][]
#    diet_drinks diet_drinks_amt diet_drinks_days
# 1:   Diet Coke               2                3
# 2: Diet Sprite               7                6
# 3:       Water              NA               NA

score 1 · Answer 2 · answered Aug 01 '20 at 21:08

You can try this:

#Data
dataset <- data.frame(
  diet_drinks = c("Diet Coke", "Diet Sprite", "Water"),
  diet_drinks_amt = c(2,7,3),
  diet_drinks_days = c(3,6,7)
)
#Index to track water
index <- which(dataset$diet_drinks=='Water')
#Replace
dataset[index,-1]<-NA

  diet_drinks diet_drinks_amt diet_drinks_days
1   Diet Coke               2                3
2 Diet Sprite               7                6
3       Water              NA               NA

Dealing with misclassified data using ifelse in R

2 Answers2

tidyverse

data.table