1

I have a complex problem. I have missing values I need to impute, but the imputation needs to be by groups in my data frame (since they were collected at different times) and I don't want to do it by mean values. I need to use the previously recorded value if it was not recorded for a particular observation.

Here is some sample data:

sample_data <- data.frame(Class = rep(x= letters[1:10], each=100),
                          group= rep(x=c("inside", "outside"), each=50),
                          Sample_number = seq(1,50,by=1),
                            x1= rnorm(1000,mean=0, sd=.5), 
                            x2= 0)
sample_data$Class_group <- paste0(sample_data$Class,"_", sample_data$group)
sample_data$Class_group <- as.factor(sample_data$Class_group)
sample_data$x1[sample_data$x1 < 0] <- NA

and here is my futile attempt:

library(data.table)

varieties=levels(sample_data$Class_group)


for (i in 1:length(levels(sample_data$Class_group))){

  variety_subset <- subset(sample_data, sample_data$Class_group==varieties[i])

  for (ii in 1:nrow(variety_subset)){
    temp_df <- subset(variety_subset, variety_subset$Sample_number==ii)
    if(is.number(temp_df$x1)){
      variety_subset$x2 <- variety_subset$x1
    } else {
      variety_subset[ , x2 := shift(x2, n=1L, type="lag")]
    }}}

I am unsure how to proceed. I think a rolling join would be the good option where selecting the value of x1 that is nearest to sample_number to impute in x2, but I don't have two dataframes I only have one.

I really need to do things row wise similar to a dplyr mutate like this question, but because of my groups that I need the imputed variables to come from the groups themselves. Which is why I am trying the for loops.

There has to be a more elegant way to do this!?

NOTE: I could do it in MS Excel though it would take a lot of double checking, but it would be easy. I'd put a formula in cell B2 Something like =IF(ISNUMBER(A2), A2, B1), the first value in my real dataset always has a recorded value so this works, but I really need to do it reproducibly in R according to my groups which are Class_group.

Here is how I'd like it to work... before:

Class,group,Sample_number,x1,x2,Class_group
a   inside  1   NA     0    a_inside
a   inside  2   NA     0    a_inside
a   inside  3   NA     0    a_inside
a   inside  4   NA     0    a_inside
a   inside  5   0.57   0    a_inside
a   inside  6   NA     0    a_inside
a   inside  7   NA     0    a_inside
a   inside  8   NA     0    a_inside
a   inside  9   0.43    0   a_inside
a   inside  10  0.19    0   a_inside
a   inside  11  0.09    0   a_inside
a   inside  12  0.13    0   a_inside
a   inside  13  0.68    0   a_inside
a   inside  14  0.50    0   a_inside
a   inside  15  0.57    0   a_inside

and after...

Class,group,Sample_number,x1,x2,Class_group
a   inside  1   NA      0.57    a_inside
a   inside  2   NA      0.57    a_inside
a   inside  3   NA      0.57    a_inside
a   inside  4   NA      0.57    a_inside
a   inside  5   0.57    0.57    a_inside
a   inside  6   NA      0.57    a_inside
a   inside  7   NA      0.57    a_inside
a   inside  8   NA      0.43    a_inside
a   inside  9   0.43    0.43    a_inside
a   inside  10  0.19    0.19    a_inside
a   inside  11  0.09    0.09    a_inside
a   inside  12  0.13    0.13    a_inside
a   inside  13  0.68    0.68    a_inside
a   inside  14  0.50    0.50    a_inside
a   inside  15  0.57    0.57    a_inside
andemexoax
  • 323
  • 3
  • 15
  • Can you give example desired output for 10 or so rows? What are you grouping values exactly- just `Class_group`? – Luke C Jul 10 '18 at 21:34
  • @LukeC, I added example output. Again in my real dataset Sample number 1 always has a recorded value for x1, but with my sample data it is not the case. – andemexoax Jul 10 '18 at 21:45
  • For row 8, why is `x2` 0.43 and not 0.57? Is it because that row is closer to row 9 which is not `NA`? – Ameya Jul 10 '18 at 22:06
  • @Ameya Yes that is why I put it that way which would be similar to the rolling join 'nearest' method, but it would also work for me if it were the other way too, to have it at 0.57. – andemexoax Jul 10 '18 at 22:25
  • `sample_data%>%group_by(Class,group)%>%fill(x1)%>%fill(x1,.direction = "up")` – Onyambu Jul 10 '18 at 23:06

2 Answers2

3

You may need ?zoo::na.locf.

library(zoo)
library(data.table)
set.seed(1) # Use for reproducibility
sample_data <- data.frame(Class = rep(x= letters[1:10], each=100),
                          group= rep(x=c("inside", "outside"), each=50),
                          Sample_number = seq(1,50,by=1),
                          x1= rnorm(1000,mean=0, sd=.5), 
                          x2= 0)
sample_data$Class_group <- paste0(sample_data$Class,"_", sample_data$group)
sample_data$Class_group <- as.factor(sample_data$Class_group)
sample_data$x1[sample_data$x1 < 0] <- NA
varieties=levels(sample_data$Class_group)
setDT(sample_data)
sample_data[
  , x2:=ifelse(
    is.na(na.locf(x1, na.rm = FALSE)), 
    na.locf(x1, na.rm = FALSE, fromLast = TRUE), 
    na.locf(x1, na.rm = FALSE)
  ), 
  by = 'Class_group'
]
> sample_data[Class_group == 'a_inside'][1:10]
    Class  group Sample_number         x1         x2 Class_group
 1:     a inside             1         NA 0.09182166    a_inside
 2:     a inside             2 0.09182166 0.09182166    a_inside
 3:     a inside             3         NA 0.09182166    a_inside
 4:     a inside             4 0.79764040 0.79764040    a_inside
 5:     a inside             5 0.16475389 0.16475389    a_inside
 6:     a inside             6         NA 0.16475389    a_inside
 7:     a inside             7 0.24371453 0.24371453    a_inside
 8:     a inside             8 0.36916235 0.36916235    a_inside
 9:     a inside             9 0.28789068 0.28789068    a_inside
10:     a inside            10         NA 0.28789068    a_inside

As an aside, there is no need to create the intermediate variable Class_group. You can use by = c('Class', 'group') to achieve the same results.

Ameya
  • 1,712
  • 1
  • 14
  • 29
  • Works great! It doesn't work as I think 'nearest' might but uses the last observation carried forward which is one of the solutions I was aiming for. Thanks! – andemexoax Jul 11 '18 at 00:05
2

Using roll="nearest" to ensure cases with consecutive NAs is closer to OP's request:

sample_data[, x2 := sample_data[!is.na(x1)][
    sample_data, x1, on=.(Class_group, Sample_number), roll="nearest"]][]

output:

    Class  group Sample_number   x1   x2 Class_group
 1:     a inside             1   NA 0.57    a_inside
 2:     a inside             2   NA 0.57    a_inside
 3:     a inside             3   NA 0.57    a_inside
 4:     a inside             4   NA 0.57    a_inside
 5:     a inside             5 0.57 0.57    a_inside
 6:     a inside             6   NA 0.57    a_inside
 7:     a inside             7   NA 0.57    a_inside
 8:     a inside             8   NA 0.43    a_inside
 9:     a inside             9 0.43 0.43    a_inside
10:     a inside            10 0.19 0.19    a_inside
11:     a inside            11 0.09 0.09    a_inside
12:     a inside            12 0.13 0.13    a_inside
13:     a inside            13 0.68 0.68    a_inside
14:     a inside            14 0.50 0.50    a_inside
15:     a inside            15 0.57 0.57    a_inside

Sample data:

library(data.table)
sample_data <- fread("Class    group    Sample_number    x1    x2    Class_group
a   inside  1   NA     0    a_inside
a   inside  2   NA     0    a_inside
a   inside  3   NA     0    a_inside
a   inside  4   NA     0    a_inside
a   inside  5   0.57   0    a_inside
a   inside  6   NA     0    a_inside
a   inside  7   NA     0    a_inside
a   inside  8   NA     0    a_inside
a   inside  9   0.43    0   a_inside
a   inside  10  0.19    0   a_inside
a   inside  11  0.09    0   a_inside
a   inside  12  0.13    0   a_inside
a   inside  13  0.68    0   a_inside
a   inside  14  0.50    0   a_inside
a   inside  15  0.57    0   a_inside")
sample_data[, x2 := as.numeric(x2)]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35