I have a complex problem. I have missing values I need to impute, but the imputation needs to be by groups in my data frame (since they were collected at different times) and I don't want to do it by mean values. I need to use the previously recorded value if it was not recorded for a particular observation.
Here is some sample data:
sample_data <- data.frame(Class = rep(x= letters[1:10], each=100),
group= rep(x=c("inside", "outside"), each=50),
Sample_number = seq(1,50,by=1),
x1= rnorm(1000,mean=0, sd=.5),
x2= 0)
sample_data$Class_group <- paste0(sample_data$Class,"_", sample_data$group)
sample_data$Class_group <- as.factor(sample_data$Class_group)
sample_data$x1[sample_data$x1 < 0] <- NA
and here is my futile attempt:
library(data.table)
varieties=levels(sample_data$Class_group)
for (i in 1:length(levels(sample_data$Class_group))){
variety_subset <- subset(sample_data, sample_data$Class_group==varieties[i])
for (ii in 1:nrow(variety_subset)){
temp_df <- subset(variety_subset, variety_subset$Sample_number==ii)
if(is.number(temp_df$x1)){
variety_subset$x2 <- variety_subset$x1
} else {
variety_subset[ , x2 := shift(x2, n=1L, type="lag")]
}}}
I am unsure how to proceed. I think a rolling join would be the good option where selecting the value of x1 that is nearest to sample_number to impute in x2, but I don't have two dataframes I only have one.
I really need to do things row wise similar to a dplyr mutate
like this question, but because of my groups that I need the imputed variables to come from the groups themselves. Which is why I am trying the for
loops.
There has to be a more elegant way to do this!?
NOTE: I could do it in MS Excel though it would take a lot of double checking, but it would be easy. I'd put a formula in cell B2 Something like =IF(ISNUMBER(A2), A2, B1)
, the first value in my real dataset always has a recorded value so this works, but I really need to do it reproducibly in R according to my groups which are Class_group.
Here is how I'd like it to work... before:
Class,group,Sample_number,x1,x2,Class_group
a inside 1 NA 0 a_inside
a inside 2 NA 0 a_inside
a inside 3 NA 0 a_inside
a inside 4 NA 0 a_inside
a inside 5 0.57 0 a_inside
a inside 6 NA 0 a_inside
a inside 7 NA 0 a_inside
a inside 8 NA 0 a_inside
a inside 9 0.43 0 a_inside
a inside 10 0.19 0 a_inside
a inside 11 0.09 0 a_inside
a inside 12 0.13 0 a_inside
a inside 13 0.68 0 a_inside
a inside 14 0.50 0 a_inside
a inside 15 0.57 0 a_inside
and after...
Class,group,Sample_number,x1,x2,Class_group
a inside 1 NA 0.57 a_inside
a inside 2 NA 0.57 a_inside
a inside 3 NA 0.57 a_inside
a inside 4 NA 0.57 a_inside
a inside 5 0.57 0.57 a_inside
a inside 6 NA 0.57 a_inside
a inside 7 NA 0.57 a_inside
a inside 8 NA 0.43 a_inside
a inside 9 0.43 0.43 a_inside
a inside 10 0.19 0.19 a_inside
a inside 11 0.09 0.09 a_inside
a inside 12 0.13 0.13 a_inside
a inside 13 0.68 0.68 a_inside
a inside 14 0.50 0.50 a_inside
a inside 15 0.57 0.57 a_inside