1

I'm trying to replace NA values in several columns by the mean value of all these columns. The mean value is suppose to be calculated by row.

I've tried this code but the NAs don't get replaced:

ID Price1 Price2 Price3 Price4
1  2.1    3      4      NA
2  2      3      4.5    NA
3  2      NA     4      NA
4  NA     3      4      NA

price_cols <- c("Price1", "Price2", "Price3", "Price4")
data %>%
  mutate_at(price_cols, funs(if_else(is.na(.), mean(price_cols, na.rm = TRUE), as.double(.))))

I've also tried adding rowwise() to the piping chain but still nothing. I know it has to do with the code not really taking the mean across rows but I don't know how to change it so it does. Help!

GreenManXY
  • 401
  • 1
  • 5
  • 14

1 Answers1

3

Using the arr.ind-parameter of which together with is.na(df) and rowMeans, you can do this quite easily in base R:

i <- which(is.na(df), arr.ind = TRUE)
df[i] <- rowMeans(df[,-1], na.rm = TRUE)[i[,1]]

which gives:

> df
  ID Price1 Price2 Price3   Price4
1  1    2.1      3    4.0 3.033333
2  2    2.0      3    4.5 3.166667
3  3    2.0      3    4.0 3.000000
4  4    3.5      3    4.0 3.500000

What this does:

With which(is.na(df), arr.ind = TRUE) you get an array-index of the row and column numbers where there is an NA-value:

> which(is.na(df), arr.ind = TRUE)
     row col
[1,]   4   2
[2,]   3   3
[3,]   1   5
[4,]   2   5
[5,]   3   5
[6,]   4   5

With rowMeans(df[,-1], na.rm = TRUE) you get a vector of the means by row:

> rowMeans(df[,-1], na.rm = TRUE)
[1] 3.033333 3.166667 3.000000 3.500000

By indexing that with the row-column of the array index, you get vector that is as long as the number of NA-values in the dataframe:

> rowMeans(df[,-1], na.rm = TRUE)[i[,1]]
[1] 3.500000 3.000000 3.033333 3.166667 3.000000 3.500000

By indexing the dataframe df with the array-index, you tell R at which spots to put those values.

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thanks, this worked! I also saw that this question was already asked but I couldn't find it because I was focused on the dpyr solution. – GreenManXY Jun 02 '17 at 18:05
  • @GreenManXY Glad I could help. `dplyr` is focussed on solving particular tasks. The *tidyverse* can be seen as an add-on to base R; nothing more, nothing less. Having (extensive) knowledge of base R functions can be really helpfull ;-) – Jaap Jun 04 '17 at 11:18