How to replace NAs with row means if proportion of row-wise NAs is below a certain threshold?

Question

Apologies for the somewhat cumbersome question, but I am currently working on a mental health study. For one of the mental health screening tools there are 15 variables, each of which can have values of 0-3. The total score for each row/participant is then assigned by taking the sum of these 15 variables. The documentation for this tool states that if more than 20% of the values for a particular row/participant are missing, the total score should be taken as missing also, however if fewer than 20% of the values for a row are missing, each missing value should be assigned the mean of the remaining values for that row.

I decided that to do this I would have to calculate the proportion of NAs for each participant, calculate the mean of all 15 variables excluding NAs for each participant, and then use a conditional mutate statement (or something similar) that checked if the proportion of NAs was less than 20% and if so replaced NAs for the relevant columns with the mean value for that row, before finding the sum of all 15 variables for each row. The dataset also contains other columns besides these 15, so applying a function to all of the columns would not be useful.

To calculate the mean score without NAs I did the following:

mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3", 
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12","var13", "var14", "var15")], na.rm=TRUE)

And to calculate the proportion of NAs for each variable:

mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11", 
"var12", "var13", "var14", "var15")]))

However when I attempted the mutate() statement to alter the rows where fewer than 20% of values were NA I can't identify any code that works. I have tried a lot of permutations by this point, including the following for each variable:

mental_recode <- mental %>%
  rowwise() %>%
  mutate(var1 = if(somatic_na<0.2) 
  replace_na(list(var1= somatic_mean)))

Which returns:

"no applicable method for 'replace_na' applied to an object of class "list""

and attempting to do them all together without using mutate():

mental %>%
  rowwise() %>%
  if(somatic_na<0.2)
                     replace_na(list(var1 = somatic_mean,   var2= 
somatic_mean,   var3 = somatic_mean,   var4 = somatic_mean,  var5 = 
somatic_mean,  var6 = somatic_mean,  var7 = somatic_mean, var8 = 
somatic_mean,  var9 = somatic_mean,  var10 = somatic_mean,   var11 = 
somatic_mean,  var12 = somatic_mean,   var13 = somatic_mean,  var14 = 
somatic_mean,  var15 = somatic_mean ))

Which returns:

Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean,  :
  the condition has length > 1 and only the first element will be used

I also tried using if_else() in conjunction with mutate() and setting the value to NA if the condition was not met, but could not get that to work after various permutations and error messages either.

EDIT: Dummy data can be generated by the following:

mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L, 
                               NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, 0L), var2 = c(0L, 
 0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L, 
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L, 
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L, 
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L, 
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L, 
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id", 
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8", 
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =                                 
"data.frame", row.names = c(NA, 
-21L))

Does anyone know of code that would work for this sort of situation?

Thanks in advance!

Even a two-row version of your dataset would be helpful, particularly if one row had >20% NAs and the other <20% NAs. — Nick Nimchuk, Aug 10 '17 at 14:34
I made an error substituting the data names for dummy names as the real ones have somewhat complex names, which I've corrected now. I'll attempt to make a short dummy dataset too. — Mel, Aug 10 '17 at 14:36
You can follow [this link](http://stackoverflow.com/questions/5963269) — Sotos, Aug 10 '17 at 14:38
what if some rows have exactly 20% NAs (i.e. 3 of the 15 columns)? — bouncyball, Aug 10 '17 at 14:53
The guidance document doesn't actually specify what to do in that case. I was originally planning to leave the 20% cases as NAs along with the >20% cases, though if there are a lot of them and I lose too many cases I may have to reconsider that. (I've added the code for the dummy data now too.) — Mel, Aug 10 '17 at 15:08

Dave Gruenewald · Accepted Answer · 2017-08-11T14:29:49.910

Here is a way to do it all in one chain using dplyr using your supplied data frame.

First create a vector of all column names of interest:

name_col <- colnames(mental)[2:16]

And now use dplyr

library(dplyr)

mental %>% 
  # First create the column of row means
  mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>% 
  # Now calculate the proportion of NAs
  mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>% 
  # Create this column for filtering out later
  mutate(somatic_usable = ifelse(somatic_na < 0.2,
                                 "yes", "no")) %>% 
  # Make the following replacement on a row basis 
  rowwise() %>%
  mutate_at(vars(name_col), # Designate eligible columns to check for NAs
            funs(replace(., 
                         is.na(.) & somatic_na < 0.2, # Both conditions need to be met
                         somatic_mean))) %>% # What we are subbing the NAs with
  ungroup() # Now ungroup the 'rowwise' in case you need to modify further

Now, if you wanted to only select the entries that have less than 20% NAs, you can pipe the above into the following:

filter(somatic_usable == "yes")

Also of note, if you wanted to instead make the condition less than or equal to 20%, you would need to replace the two somatic_na < 0.2 with somatic_na <= 0.2.

Hope this helps!

Thanks! I've tried this but I'm now getting: "Error: All select() inputs must resolve to integer column positions. The following do not: * name_col" I've tried putting vars_() or mutate_at_() instead but those don't appear to exist in the way select_() does. Any ideas as to what I've done wrong? — Mel, Aug 11 '17 at 08:39
Hmm, it might be a `dplyr` version issue. I'm using `dplyr 0.7.1`. Alternatively, you might be including a column name in the `name_col` vector that is not in the `mental` data frame you provided. — Dave Gruenewald, Aug 11 '17 at 14:30
If I just list them out it works fine, I think it might be because the version of R we're using at work is not current. Thanks! — Mel, Aug 18 '17 at 13:31

bouncyball · Answer 2 · 2017-08-10T15:10:35.397

Here's a way using just base R expressions and remember mathematical properties of sums and means:

# generate fake data
set.seed(123)

dat <- data.frame(
  ID = 1:10,
  matrix(sample(c(0:3, NA), 10 * 15, TRUE), nrow = 10, ncol = 15),
  'another_var' = 'foo',
  'second_var' = 'bar',
  stringsAsFactors = FALSE
)

var_names <- paste0('X', 1:15)

# add number of NAs to data

dat$na_num <- rowSums(is.na(dat[var_names]))

# add row sum

dat$row_sum <- rowSums(dat[var_names], na.rm = TRUE) 

# add row mean

dat$row_mean <- rowMeans(dat[var_names], na.rm = TRUE)

# add final sum

dat$final_sum <- dat$row_sum + dat$row_mean * dat$na_num

# recode final sum to be NA if prop > .2

dat$final_sum <- ifelse(rowMeans(is.na(dat[var_names])) > .2,
                        NA,
                        dat$final_sum)

Here's a function that does the same thing. Where you specify your data and then a character vector of your variable names.

total_sum_calculation <- function(data, var_names){
  # add number of NAs to data
  na_num <- rowSums(is.na(data[var_names]))

  # add row sum
  row_sum <- rowSums(data[var_names], na.rm = TRUE) 

  # add row mean
  row_mean <- rowMeans(data[var_names], na.rm = TRUE)

  # add final sum
  final_sum <- row_sum + row_mean * na_num

  # recode final sum to be NA if prop > .2
  ifelse(rowMeans(is.na(data[var_names])) > .2,
                          NA,
                          final_sum)

}

v_names <- paste0('var', 1:15)
total_sum_calculation(data = mental, var_names = v_names)

 [1] 6.000000 0.000000 8.000000 7.500000       NA 0.000000 3.214286 9.230769 6.000000 2.000000 1.000000 0.000000 4.285714
[14]       NA 5.357143 5.357143 5.357143 9.642857 1.071429       NA 3.000000

How to replace NAs with row means if proportion of row-wise NAs is below a certain threshold?

2 Answers2

Linked