2

I want to update NAs in numeric columns with median values for that column.

dt <- data.table(
  name = c("A","B","C","D","E"),
  sex = c("M","F",NA,"F","M"),
  age = c(1,2,3,NA,4),
  height = c(178.1, 162.1, NA, 169.5, 172.3)
)

Extract the numeric columns

num.cols <-  sapply(dt, is.numeric)
num.cols <- names(num.cols)[num.cols]

Check values

median(dt[,age], na.rm = T) # 2.5
median(dt[,height], na.rm = T) #170.9

Use lapply for each num.cols

dt[,lapply(.SD, function(value) 
ifelse(is.na(value), median(value, na.rm=TRUE), value)),
.SDcols = num.cols]

Question, I cannot work out how to overwrite the vector with NA with the vector of imputed medians in data.table syntax ?

Uwe
  • 41,420
  • 11
  • 90
  • 134
iboboboru
  • 1,112
  • 2
  • 10
  • 21

1 Answers1

2

We can use the na.aggregate from zoo and specify the FUN as median to impute the missing values with median for the selected columns specified in .SDcols and assign (:=) the values to the concerned columns

library(zoo)
dt[, (num.cols) := na.aggregate(.SD, FUN = median),.SDcols = num.cols]
dt
#   name sex age height
#1:    A   M 1.0  178.1
#2:    B   F 2.0  162.1
#3:    C  NA 3.0  170.9
#4:    D   F 2.5  169.5
#5:    E   M 4.0  172.3
akrun
  • 874,273
  • 37
  • 540
  • 662