How can I replace NA's with mean values for numeric columns and with mode values for character columns by group?

Question

I have an automobile data set (auto_data) with NA values across 6 columns.

auto_data$normalized.losses = replace_na(auto_data$normalized.losses,mean(auto_data$normalized.losses, na.rm = TRUE))

auto_data$num.of.doors = replace_na(auto_data$num.of.doors,mode(auto_data$num.of.doors, na.rm = TRUE))

for (i in 1:ncol(auto_data)) {
  if ((is.numeric(i)) & (is.na(i)))
  {
    replace_na(i,mean(i, na.rm = TRUE))
  }
  
}

This is as far as I have gotten, however for num.of.doors(character, either two or four), the replaced NAs read 'character' instead of either 'two' or 'four'. And the for loop just does not change anything.

I would also like the mode/mean to be grouped by make and body_style but figured I need to try and step through this preliminary step of getting means and modes setup first. I have messed around with adding a group by function wrapping replace_na().

auto_data table

Code source: https://www.kaggle.com/datasets/toramky/automobile-dataset?resource=download

     make = c("alfa-romero", "alfa-romero", "alfa-romero", 
"audi", "audi", "audi", "audi")
symboling = c(3L, 3L, 1L, 2L, 2L, 2L, 1L)
normalized.losses = c(NA, NA, NA, 164L, 164L, NA, 
158L)
fuel.type = c("gas", "gas", "gas", "gas", "gas", "gas", "gas")
aspiration = c("std", "std", "std", "std", "std", "std", "std")
num.of.doors = c("two", "two", "two", "four", "four", "two", "four")
body.style = c("convertible", "convertible","hatchback", "sedan", "sedan", "sedan", "sedan")
price= c(13495, 18705, NA, 17217, 17293, NA, 18304)

auto_data_sample= data.frame(make,symboling,fuel.type,aspiration, num.of.doors, body.style, price)

Hello Abbi, next time please provide a minimal reproducible example. Do this thread help? https://stackoverflow.com/questions/18996156/cell-mean-imputation — fbeese, Jun 30 '22 at 20:50
The loop manipulates the index variable `i` of the loop and not the data of the data frame. `i` is a vector `1:n` and is always numeric. To get the column you have to access it with the index `auto_data[[i]]`. — Jan, Jun 30 '22 at 20:50
@fbeese yes a part of that did help me revise my for loop body to: auto_data[,i][is.na(auto_data[,i])] = mean(auto_data[,i], na.rm=TRUE). Which worked, thanks! — Abbi Asseged, Jun 30 '22 at 22:43

Gregor Thomas · Answer 1 · 2022-07-01T10:40:38.957

Here's an idea using dplyr:

library(dplyr)
auto_data_sample %>%
  group_by(make, body.style) %>%
  mutate(
    across(where(is.numeric), ~replace_na(., replace = mean(., na.rm = TRUE))),
    across(where(is.character), ~replace_na(., replace = Mode(., na.rm = TRUE)))
  )
# # A tibble: 7 × 7
# # Groups:   make, body.style [3]
#   make        symboling fuel.type aspiration num.of.doors body.style   price
#   <chr>           <dbl> <chr>     <chr>      <chr>        <chr>        <dbl>
# 1 alfa-romero         3 gas       std        two          convertible 13495 
# 2 alfa-romero         3 gas       std        two          convertible 18705 
# 3 alfa-romero         1 gas       std        two          hatchback     NaN 
# 4 audi                2 gas       std        four         sedan       17217 
# 5 audi                2 gas       std        four         sedan       17293 
# 6 audi                2 gas       std        two          sedan       17605.
# 7 audi                1 gas       std        four         sedan       18304

Where I use the Mode function from this answer.

Mode <- function(x, na.rm = FALSE) {
  if(na.rm){
    x = x[!is.na(x)]
  }

  ux <- unique(x)
  return(ux[which.max(tabulate(match(x, ux)))])
}

Do note that the mode is not necessarily unique. This implementation will pick one if there are multiple modes. I think it will pick the one that occurs first in the data, but I'm not positive.

If you need more help, please provide some sample data.

I have tried to add some sample code for you, I hope it is enough, I am struggling to properly subset this large data set but will have a more thorough code sample shortly. Is the 'x' in your dpylr solution supposed to be replaced with a specific column? It returns an error Caused by error in `across()`: ! Problem while computing column `symboling`. Caused by error in `mean()`: ! object 'x' not found and when replaced with auto_data it return multiple warning with no change in data — Abbi Asseged, Jun 30 '22 at 21:52
Updated - thanks for the sample data, that helped me debug it. — Gregor Thomas, Jul 01 '22 at 10:42

How can I replace NA's with mean values for numeric columns and with mode values for character columns by group?

1 Answers1