In R how do I replace the missing values with the column mean?

Question

I have missing data that I want to replace with the column means. If anyone could provide the command on how to do this with R. This topic has shown up on the site but the instructions haven't been general enough for me to be able to complete the command. Any help would be greatly appreciated.

Hello and welcome to SO. How are your missing data represented? as `NA`, as `""`, ? To help make a reproducible example, you can use `reproduce()` . Instructions are here: http://bit.ly/SORepro - [How to make a great R reproducible example](http://bit.ly/SORepro) — Ricardo Saporta, Jan 16 '14 at 16:43
Your question is missing information without which it cann't be answered. Please follow the instructions in the link provided by @RicardoSaporta. — Roland, Jan 16 '14 at 16:44
Replacing missing values with column means is generally regard as a _really_ bad idea. If you go on to do any modelling with the data, the standard errors you obtain will be artificially small because of this. — Stuples, Jan 16 '14 at 16:52
We imported our data as csv from an excel file where the missing data was blank. We think the data is represented by NA. The data doesn't belong to us, if we use the reproduce command will it upload the data? We don't want to upload data that doesn't belong to us. — user3203431, Jan 16 '14 at 16:54
@user3203431 Don't upload the read stuff then. Make up your own minimal sample. In fact pasting a large real data set is usually a bad idea as it unnecessarily complicates things. So something ike: `x <- data.frame(matrix(rnorm(30), ncol=3)); x[c(4, 9), c(1, 2, 3)] <- NA; x` would produce a minimally working example. — Tyler Rinker, Jan 16 '14 at 17:02
Is Tyler's answer of [this question](http://stackoverflow.com/questions/9322773/how-to-replace-na-with-mean-by-subset-in-r-impute-with-plyr) of any help? — marbel, Jan 16 '14 at 17:07

marbel · Answer 1 · 2014-01-16T18:13:21.320

As there are not many details in your question, this is what i imagined that could be your problem. Use require(reshape2); melt(yourdata) to convert yourdata to long format if it happens to be in wide format. Edit: Added a wide and long format example. I'm lacking of a ddply way to sovlve this in the wide format case. Please edit to add it.

require(data.table)
require(plyr)

Long Format

set.seed(123)
df <- data.frame(group = sample(c(letters[1:5]), 10e5, replace=T),
                 q_var = sample(c(rpois(10, 50), NA), 10e5, replace = T))
DT <- data.table(df)

impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

# Impute by group
imp1 <- ddply(df, ~ group, transform, q_var = impute.mean(q_var))

table(df$group)
length(df$group)

imp2 <- DT[, lapply(.SD, impute.mean), by = "group"]
table(DT$group)
length(DT$group)

require(rbenchmark)

imp_ddply <- function(x){
  ddply(df, ~ group, transform, q_var = impute.mean(q_var))
}

imp_DT <- function(x){
  DT[, lapply(.SD, impute.mean), by = "group"]
}

benchmark(imp_ddply(df), imp_DT(DT))
#          test replications elapsed relative user.self sys.self 
# imp_ddply(df)          100  156.47   13.419    149.94     6.35 
#    imp_DT(DT)          100   11.66    1.000     11.61     0.04

Wide Format

require(reshape2)

wdf <- data.frame(matrix(sample(c(rpois(10, 50), NA), 900000, replace = T), ncol=3))
WDT <- data.table(wdf)

wide_imp1 <- apply(wdf, 2, impute.mean)
wide_imp2 <- WDT[, lapply(.SD, impute.mean)]

wide_apply <- function(x) apply(wdf, 2, impute.mean)
wide_DT <- function(x) WDT[, lapply(.SD, impute.mean)]

benchmark(wide_apply(wdf), wide_DT(WDT))
#             test replications elapsed relative user.self sys.self
#  wide_apply(wdf)          100    7.84    1.413      7.84        0
#     wide_DT(WDT)          100    5.55    1.000      5.55        0

score 1 · Answer 2 · edited Mar 03 '14 at 17:48

1

Using Tyler's data from above

x[is.na(x$X1) == "TRUE", 1] <- mean(x$X1, na.rm = T)

edited Mar 03 '14 at 17:48

Thomas

43,637
12
109
140

answered Jan 16 '14 at 17:55

Randall

31
1

1

what about the other columns? – marbel Jan 16 '14 at 18:03

In R how do I replace the missing values with the column mean?

2 Answers2

Wide Format