Replace NA with mean based on row subset matching another column?

Question

I have data where each row contains a person's gender and weight (in lbs):

genders <- c("FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "MALE", "MALE")
weights <- c(110.0, 120.0, 112.0, NA, NA, 190.0, 202.0, 195.0, NA)

df <- data.frame(gender=genders, weight=weights)
df
#   gender weight
# 1 FEMALE    110
# 2 FEMALE    120
# 3 FEMALE    112
# 4 FEMALE     NA
# 5 FEMALE     NA
# 6   MALE    190
# 7   MALE    202
# 8   MALE    195
# 9   MALE     NA

For every row that has an NA in the weight column, I would like to replace / impute the NA with the weight mean, but the mean should be calculated using only the rows that match the same gender value as the row with the NA.

Specifically, rows 4 and 5 have a gender of FEMALE and weight of NA. I want to replace the NA with the mean weight computed over the subset of rows that match the gender of FEMALE. In this case, the mean would be (110+120+112)/3=114.0 from the other rows 1, 2, and 3.

Likewise, I want to replace the NA in row 9 with the mean of the weights for MALE gender.

I tried the following command, but it replaced the NA with the mean weight over all users across both genders, which is not what I want.

df$weight[is.na(df$weight)] <- mean(subset(df, gender=df$gender)$weight, na.rm=T)
df
#   gender   weight
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 154.8333
# 5 FEMALE 154.8333
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 154.8333

I searched other questions, but they are not quite the same problem as mine:

"Replace NA with mean matching the same ID"

"How to replace NA with mean by subset in R (impute with plyr?)"

"How to replace NA values in a table for selected columns? data.frame, data.table"

score 7 · Accepted Answer · answered Jul 31 '16 at 19:42

7

You could use ave() with replace() (or standard manual replacement).

df$weight <- with(df, ave(weight, gender,
    FUN = function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))))

which gives

  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667

answered Jul 31 '16 at 19:42

Rich Scriven

97,041
11
181
245

Thank you. Simple answer without extra packages. Just what I was looking for. That `ave()` function looks very powerful. – stackoverflowuser2010 Jul 31 '16 at 19:45

score 5 · Answer 2 · answered Jul 31 '16 at 19:37

5

You can group your data frame by gender and then calculate the average of weight and replace the NA with ifelse statement, in dplyr, it could be:

library(dplyr)
df %>% 
      group_by(gender) %>% 
      mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = T), weight))

# Source: local data frame [9 x 2]
# Groups: gender [2]

#  gender   weight
#  <fctr>    <dbl>
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 114.0000
# 5 FEMALE 114.0000
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 195.6667

answered Jul 31 '16 at 19:37

Psidom

209,562
33
339
356

The new `coalesce` function fits well here. – alistaire Jul 31 '16 at 23:11
@alistaire Looks very useful and handy. – Psidom Aug 01 '16 at 00:29

score 2 · Answer 3 · answered Jul 31 '16 at 19:38

2

Using base R this appears to be what you're looking for:

df$weight[df$gender=="FEMALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="FEMALE"], na.rm=TRUE)
df$weight[df$gender=="MALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="MALE"], na.rm=TRUE)

> df
  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667

answered Jul 31 '16 at 19:38

Warner

1,353
9
23

This is very manual approach. How would they use is for more than two groups? See in comments a more generalized approach – David Arenburg Jul 31 '16 at 19:40
Is there a way to do this without hard-coding "FEMALE" and "MALE"? The data in a column could have dozens of unique values. – stackoverflowuser2010 Jul 31 '16 at 19:41
@DavidArenburg good point. It looks like Richard Scriven's approach is better and works in the case where there's several unique values. – Warner Jul 31 '16 at 19:47

score 1 · Answer 4 · answered Aug 01 '16 at 02:59

This can be easily done using na.aggregate from zoo. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'gender', we apply the na.aggregate to 'weight' to replace the NA elements with the mean value. By default, the na.aggregate return the mean, but we can also change the FUN argument to get median or sum etc.

library(data.table)
library(zoo)
setDT(df)[, weight := na.aggregate(weight) , by = gender]

Or with ave from base R

with(df, ave(weight, gender, FUN = na.aggregate))
#[1] 110.0000 120.0000 112.0000 114.0000 114.0000 190.0000 202.0000 195.0000 195.6667

Replace NA with mean based on row subset matching another column?

4 Answers4