In R, I want to create two new variables if the variable I'm checking has missing values

Question

df <- data.frame(replicate(10,sample(0:100,1000,rep=TRUE)))
eee <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
View(eee)

This gives me a data frame with missing data.

If a variable in my current data frame has missing values, then I want to create two new variables. The first being a binary "yes" this was missing or "no" it wasn't missing. I want the second variable to be the same as the original, if the variable is not missing. If it is missing, I want to impute the mean of the original variable for my new column.

I'm not sure how to write the code to do this checking my whole data set instead of doing each variable individually.

Thank you for the help!

can you provide a reproducible example https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Bulat, Sep 23 '19 at 19:44
Is `population.data$j` a single value or a vector of values? If you want to check if one NA is present in the column, please check: https://stackoverflow.com/questions/6551825/fastest-way-to-detect-if-vector-has-at-least-1-na In addition, — Chelmy88, Sep 23 '19 at 20:13
Now that I see your data, looks like you want to cover multiple columns. — markhogue, Sep 23 '19 at 20:37

markhogue · Answer 1 · 2019-09-24T02:08:44.860

0

I worked something out that is crude but effective.

df <- data.frame(replicate(10,sample(0:100,1000,rep=TRUE)))

eee <- as.data.frame(lapply(df, 
  function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))



replace_fn1 <- function(x) ifelse(is.na(x), "yes", "no")
pt1 <- apply(eee, c(1, 2), replace_fn1)


col_means <- as.data.frame(t(apply(eee, 2, mean, na.rm = TRUE)))

#set up df with same size of all column means

col_means <- as.data.frame(matrix(col_means, 
                          nrow = 1000, ncol = 10, byrow = TRUE))

pt2 <- pt1
na_ind <- which(is.na(eee), arr.ind = TRUE)
pt2[na_ind] <- col_means[na_ind]

edited Sep 24 '19 at 02:08

answered Sep 23 '19 at 20:25

markhogue

1,056
1
6
16

Awesome, Thank you! – Ben Rossmiller Sep 23 '19 at 21:12
Is there a straight-forward way to use similar logic to do the second part? If I want a variable that is the 'original' if it is NOT missing, and is the mean of that column if it is missing? – Ben Rossmiller Sep 23 '19 at 23:07
Did you see this latest? Does it get a useful rating? – markhogue Sep 24 '19 at 14:35

In R, I want to create two new variables if the variable I'm checking has missing values

1 Answers1