How to use which.max in R accounting for groups to impute missing values?

Question

I have the following dataframe (where both are factor variables):

     A   B
1 pine NA
2  fig 234
3  fig 234
4  fig 145
5 pine 123
6  fig NA

I'm wanting to replace missing values in B with the most frequently occurring value within group A in the dataset. The above would become:

     A   B
1 pine 123
2  fig 234
3  fig 234
4  fig 145
5 pine 123
6  fig 234

I've found the code below on a similar question which makes use of the which.max function, but can't seem to get it work group wise, with it returning the overall max of column B for every NA value, instead of the max within each group of A.

df2 <- df1 %>%
  group_by(A) %>%
  add_count(B) %>%
  mutate(B = if_else(is.na(B), B[which.max(n)], B)) %>%
  select(-n) %>%
  ungroup()

Do I need an extra group by somewhere?

If both of them are factor, either create a new level in that column before `replace`ment or convert to `character` class — akrun, Jul 23 '20 at 22:20

score 1 · Answer 1 · answered Jul 23 '20 at 22:22

We can use Mode function from here. After grouping by 'A', use the na.aggregate from zoo specifying the FUN as Mode to replace the NA elements with the Mode of that column

library(dplyr)
library(zoo)
df1 %>%
      group_by(A) %>%
      mutate(B = na.aggregate(B, FUN = Mode))
# A tibble: 6 x 2
# Groups:   A [2]
#  A         B
#  <chr> <int>
#1 pine    123
#2 fig     234
#3 fig     234
#4 fig     145
#5 pine    123
#6 fig     234

where

Mode <- Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
 }

data

df1 <- structure(list(A = c("pine", "fig", "fig", "fig", "pine", "fig"
), B = c(NA, 234L, 234L, 145L, 123L, NA)), class = "data.frame", 
row.names = c("1", 
"2", "3", "4", "5", "6"))

Thank you. I can see that works with the test data but not with my actual dataset, so there's something going on with my data that I don't realise. I also tried converting columns to character and numeric types but no luck. — Hulleo, Jul 23 '20 at 22:50
@Hulleo I showed an example with `dput` to make it reproducible — akrun, Jul 23 '20 at 22:51

score 0 · Answer 2 · answered Jul 23 '20 at 22:28

You can try(Maybe not the most elegant):

library(tidyverse)

#Data
df <- structure(list(A = structure(c(2L, 1L, 1L, 1L, 2L, 1L), .Label = c("fig", 
"pine"), class = "factor"), B = c(NA, 234L, 234L, 145L, 123L, 
NA)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))

#Code
df %>% left_join(df %>% filter(!is.na(B)) %>% group_by(A,B) %>% summarise(N=n()) %>% ungroup() %>%
  group_by(A) %>% filter(N==max(N)) %>% rename(Fill=B)) %>% group_by(A) %>%
  mutate(B=if_else(is.na(B),Fill,B)) %>% select(-c(Fill,N))

# A tibble: 6 x 2
# Groups:   A [2]
  A         B
  <fct> <int>
1 pine    123
2 fig     234
3 fig     234
4 fig     145
5 pine    123
6 fig     234

How to use which.max in R accounting for groups to impute missing values?

2 Answers2

data