1

I have a large dataset with some missing values (NAs). I'm looking to replace these values with the column means but by class, that is, where items in class k have a missing value in column j, that value will be replaced by the mean of values in column J for items in class k. Additionally, I want to do this with only base R or dplyr.

The class aspect brings an additional problem compared to the well-known one that's already been answered here: Replace missing values with column mean.

In fact I can adapt one of the solutions there into a clumsy solution for my problem:

NA2mean <- function(x){replace(x, is.na(x), mean(x, na.rm = TRUE))}
DF %>% filter(DF$class=="A") -> A
A <- lapply(A,NA2mean)

(where the dataframe is DF and I have assumed the factor is stored in the column 'class'.)

Then you'd repeat this for every other class (e.g. B, C, D, E, F). Finally you could use DF <- rbind(A,B,C,D,E,F) to replace your old dataframe with the corrected one.

The dataframe in my case is ordered by class (i.e. A first, then B, then C, ...) and I'd like to keep it that way.

Any way of doing this much more efficiently?

Mobeus Zoom
  • 598
  • 5
  • 19

3 Answers3

1

Base R Solution:

df[, sapply(df, is.numeric)] <-
  do.call("rbind", lapply(split(df[, sapply(df, is.numeric)], df$class), function(x) {
    x <- ifelse(is.na(x), mean(x, na.rm = TRUE), x)
  }))
hello_friend
  • 5,682
  • 1
  • 11
  • 15
1

We can use na.aggregate from zoo

library(dplyr)
library(zoo)
DF %>%
  group_by(class) %>%
  mutate_at(vars(-group_cols()), na.aggregate)

If we need base R

nm1 <- setdiff(names(DF), "class")
DF[nm1] <- lapply(DF[nm1], function(vec) ave(vec, class, FUN = NA2mean))
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Using dplyr, you could group_by Class and apply NA2mean for every column.

library(dplyr)
DF %>% group_by(class) %>% mutate_all(NA2mean)

In the newer version of dplyr, you can do this across

DF %>% group_by(class) %>% mutate(across(everything(), NA2mean))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213