2

I am working on a data frame with multiple data types.I would like to replace NA values only on numerical columns with the median of that particular column.I have seen questions on replacing with mean a lot, but not median. My df is similar to the following code:

my_groups <- c(rep("A", 5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(3, NA, 4, 8, 2, 11, 15, NA, 9, 10)
my_df <- data.frame(my_groups, my_values_1, my_values_2)
my_df %>% select_if(is.numeric)

This gives me numerical columns, but I cant figure out the next step.

Tommy
  • 39
  • 5

3 Answers3

5

Here are several approaches. The test data frame DF is defined in (1) and used in the other approaches as well.

1) dplyr - across/coalesce

library(dplyr)

# test data
DF <- data.frame(a = c(NA, NA, 1, 2), b = 1:4, c = letters[1:4])

DF %>% 
  mutate(across(where(is.numeric), ~ coalesce(., median(., na.rm = TRUE))))

giving:

    a b c
1 1.5 1 a
2 1.5 2 b
3 1.0 3 c
4 2.0 4 d

2) dplyr/tidyr - across/replace_na

library(dplyr)
library(tidyr)

DF %>% 
  mutate(across(where(is.numeric), ~ replace_na(., median(., na.rm = TRUE))))

3) zoo - na.aggregate

library(zoo)

ok <- sapply(DF, is.numeric)
replace(DF, ok, na.aggregate(DF[ok], FUN = median))

4) Base R

na.median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))   
ok <- sapply(DF, is.numeric)
replace(DF, ok, lapply(DF[ok], na.median))

5) Base R - S3

na.median <- function(x, ...) UseMethod("na.median")
na.median.default <- identity
na.median.numeric <- function(x, ...) {
  replace(x, is.na(x), median(x, na.rm = TRUE))   
}

replace(DF, TRUE, lapply(DF, na.median))

6) magrittr We first make a copy of DF to avoid clobbering it -- although not recommend you can just use DF in last line if you are ok with overwriting it -- and use magrittr %<>%. na.median is from (4).

library(magrittr)

DF2 <- DF
DF2[sapply(DF2, is.numeric)] %<>% lapply(na.median)

7) collapse - ftmv ftmv or its synonym ftransformv provide a compact expression. This uses na.median is from (4).

library(collapse)

tfmv(DF, is.numeric, na.median)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

We could use mutate with across and an ifelse statement: Note: D. Grothendieck answer works also perfect!

library(dplyr)
my_df %>% 
  mutate(across(where(is.numeric), ~ifelse(is.na(.), median(.,na.rm=TRUE), .)))

output:

   my_groups my_values_1 my_values_2
1          A         4.0         3.0
2          A         9.0         8.5
3          A        10.0         4.0
4          A         8.5         8.0
5          A         5.0         2.0
6          B        12.0        11.0
7          B         8.5        15.0
8          B         7.0         8.5
9          B        11.0         9.0
10         B         8.0        10.0
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

An option with case_when

library(dplyr)
 my_df %>% 
   mutate(across(where(is.numeric), 
     ~ case_when(is.na(.) ~ median(., na.rm = TRUE), TRUE ~ .)))

-output

 my_groups my_values_1 my_values_2
1          A         4.0         3.0
2          A         9.0         8.5
3          A        10.0         4.0
4          A         8.5         8.0
5          A         5.0         2.0
6          B        12.0        11.0
7          B         8.5        15.0
8          B         7.0         8.5
9          B        11.0         9.0
10         B         8.0        10.0
akrun
  • 874,273
  • 37
  • 540
  • 662