I have missing data that I want to replace with the column means. If anyone could provide the command on how to do this with R. This topic has shown up on the site but the instructions haven't been general enough for me to be able to complete the command. Any help would be greatly appreciated.
Asked
Active
Viewed 5,174 times
4
-
3Hello and welcome to SO. How are your missing data represented? as `NA`, as `""`, ? To help make a reproducible example, you can use `reproduce(
)` . Instructions are here: http://bit.ly/SORepro - [How to make a great R reproducible example](http://bit.ly/SORepro) – Ricardo Saporta Jan 16 '14 at 16:43 -
Your question is missing information without which it cann't be answered. Please follow the instructions in the link provided by @RicardoSaporta. – Roland Jan 16 '14 at 16:44
-
4Replacing missing values with column means is generally regard as a _really_ bad idea. If you go on to do any modelling with the data, the standard errors you obtain will be artificially small because of this. – Stuples Jan 16 '14 at 16:52
-
We imported our data as csv from an excel file where the missing data was blank. We think the data is represented by NA. The data doesn't belong to us, if we use the reproduce command will it upload the data? We don't want to upload data that doesn't belong to us. – user3203431 Jan 16 '14 at 16:54
-
2@user3203431 Don't upload the read stuff then. Make up your own minimal sample. In fact pasting a large real data set is usually a bad idea as it unnecessarily complicates things. So something ike: `x <- data.frame(matrix(rnorm(30), ncol=3)); x[c(4, 9), c(1, 2, 3)] <- NA; x` would produce a minimally working example. – Tyler Rinker Jan 16 '14 at 17:02
-
1Is Tyler's answer of [this question](http://stackoverflow.com/questions/9322773/how-to-replace-na-with-mean-by-subset-in-r-impute-with-plyr) of any help? – marbel Jan 16 '14 at 17:07
2 Answers
1
As there are not many details in your question, this is what i imagined that could be your problem.
Use require(reshape2); melt(yourdata)
to convert yourdata to long format if it happens to be in wide format.
Edit: Added a wide and long format example. I'm lacking of a ddply way to sovlve this in the wide format case. Please edit to add it.
require(data.table)
require(plyr)
Long Format
set.seed(123)
df <- data.frame(group = sample(c(letters[1:5]), 10e5, replace=T),
q_var = sample(c(rpois(10, 50), NA), 10e5, replace = T))
DT <- data.table(df)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
# Impute by group
imp1 <- ddply(df, ~ group, transform, q_var = impute.mean(q_var))
table(df$group)
length(df$group)
imp2 <- DT[, lapply(.SD, impute.mean), by = "group"]
table(DT$group)
length(DT$group)
require(rbenchmark)
imp_ddply <- function(x){
ddply(df, ~ group, transform, q_var = impute.mean(q_var))
}
imp_DT <- function(x){
DT[, lapply(.SD, impute.mean), by = "group"]
}
benchmark(imp_ddply(df), imp_DT(DT))
# test replications elapsed relative user.self sys.self
# imp_ddply(df) 100 156.47 13.419 149.94 6.35
# imp_DT(DT) 100 11.66 1.000 11.61 0.04
Wide Format
require(reshape2)
wdf <- data.frame(matrix(sample(c(rpois(10, 50), NA), 900000, replace = T), ncol=3))
WDT <- data.table(wdf)
wide_imp1 <- apply(wdf, 2, impute.mean)
wide_imp2 <- WDT[, lapply(.SD, impute.mean)]
wide_apply <- function(x) apply(wdf, 2, impute.mean)
wide_DT <- function(x) WDT[, lapply(.SD, impute.mean)]
benchmark(wide_apply(wdf), wide_DT(WDT))
# test replications elapsed relative user.self sys.self
# wide_apply(wdf) 100 7.84 1.413 7.84 0
# wide_DT(WDT) 100 5.55 1.000 5.55 0

marbel
- 7,560
- 6
- 49
- 68
1
Using Tyler's data from above
x[is.na(x$X1) == "TRUE", 1] <- mean(x$X1, na.rm = T)