Dealing with missing values when calculating weighted mean

Question

Let's say I have the following dataframe:

df <- as.data.frame(rbind(c(5, NA, NA, 1, NA, NA),
                         c(NA, 2, 2, NA, 0.5, 0.5), 
                         c(NA, NA, NA, NA, NA, NA),
                         c(1, 1, 1, 0.33, 0.33, 0.33)))
colnames(df) <- c("V1", "V2", "V3", "W1", "W2", "W3")

I would like to add a weighted mean to the dataframe, discarding the NA when they exist. For example, in the first line, we would only take V1 and W1 to calculate the weighted mean.

My final dataframe would look like this:

   V1 V2  V3  W1   W2   W3   Wmean
1   5 NA  NA   1   NA   NA       5
2  NA  2   2  NA  0.5  0.5.      2
3  NA NA  NA  NA   NA   NA      NA
4   1  1   1  .33  .33  .33      1

Note that the weighted mean is calculated as such: Wmean = (V1*W1 + V2*W2 + V3*W3)/(W1 + W2 + W3)

Jaap · Answer 1 · 2018-07-05T08:31:19.207

Using:

df$Wmean <- rowSums(df[1:3] * df[4:6], na.rm = TRUE) / rowSums(df[4:6], na.rm = TRUE)

gives:

> df
  V1 V2 V3   W1   W2   W3 Wmean
1  5 NA NA 1.00   NA   NA     5
2 NA  2  2   NA 0.50 0.50     2
3 NA NA NA   NA   NA   NA   NaN
4  1  1  1 0.33 0.33 0.33     1

If you don't know the positions of the V and W columns beforehand, you can do (like @Ronak does with startsWith):

Vi <- grep("^V", names(df))
Wi <- grep("^W", names(df))

df$Wmean <- rowSums(df[Vi] * df[Wi], na.rm = TRUE) / rowSums(df[Wi], na.rm = TRUE)

score 2 · Accepted Answer · answered Jul 05 '18 at 08:25

We can gather indices of columns with "V" and "W" separately and then apply, weighted.mean for each row ignoring the NA values.

V_index <- startsWith(names(df), "V")
W_index <- startsWith(names(df), "W")
df$WMean <- apply(df, 1, function(x) 
                  weighted.mean(x[V_index], x[W_index], na.rm = TRUE))

df

#  V1 V2 V3   W1   W2   W3 WMean
#1  5 NA NA 1.00   NA   NA     5
#2 NA  2  2   NA 0.50 0.50     2
#3 NA NA NA   NA   NA   NA   NaN
#4  1  1  1 0.33 0.33 0.33     1

score 2 · Answer 3 · answered Jul 05 '18 at 08:28

You can also replace NA with 0s and do it this way:

df[is.na(df)] <- 0
df$mean = ((df$V1 * df$W1) + (df$V2 * df$W2) + (df$V3 * df$W3))/(df$W1 + df$W2 + df$W3)

Output:

  V1 V2 V3   W1   W2   W3 mean
1  5  0  0 1.00 0.00 0.00    5
2  0  2  2 0.00 0.50 0.50    2
3  0  0  0 0.00 0.00 0.00  NaN
4  1  1  1 0.33 0.33 0.33    1

Dealing with missing values when calculating weighted mean

3 Answers3