2

I am trying to impute missing value NA with interpolation by multiple groups.

I just subset a simple example:

Year    ST   CC   ID     MP     PS 
2002    15   3     3     NA    1.5
2003    15   3     3     NA    1.5
2004    15   3     3    193    1.5
2005    15   3     3    193    1.5
2006    15   3     3    348    1.5
2007    15   3     3    388    1.5
2008    15   3     3    388    1.5
1999    53   33    1     NA    3.4 
2000    53   33    1     NA    3.4        
2002    53   33    1     NA    2.9           
2003    53   33    1     NA    2.6           
2004    53   33    1     NA    2.6         
2005    53   33    1    170    3.8         
2006    53   33    1    170    3.0           
2007    53   33    1    330    4.2         
2008    53   33    1    330    5.0           

I used na.approx() but got wrong. It seems that my data has missing value in the first observation within each group.

setDT(dt)[, MP_interpolate := na.approx(MP, na.rm = T), .(Year, ST, CC, ID)]

setDT(dt)[, MP_interpolate := if(length(na.omit(MP))<2) MP else na.approx(MP, na.rm=TRUE), .(Year, ST, CC, ID)]

I also tried package imputeTS but not work.

Both are incorrect. Does it mean using interpolation to impute missing is not a good idea?

I'm not sure which imputation way is better (Do not want to use mean or median). I'm thinking about using PS trend to impute MP. (Just my thought, not question)

Peter Chen
  • 1,464
  • 3
  • 21
  • 48

2 Answers2

3

You could try imputeTS::na_kalman in an ave. This also extrapolates, what you probably want.

library(imputeTS)

dt$MP.imp <- with(dt, ave(MP, ST, CC, ID, FUN=na_kalman))
#    Year ST CC ID  MP  PS   MP.imp
# 1  2001 15  3  3  NA 1.5 193.0000
# 2  2002 15  3  3  NA 1.5 193.0000
# 3  2003 15  3  3  NA 1.5 193.0000
# 4  2004 15  3  3 193 1.5 193.0000
# 5  2005 15  3  3 193 1.5 193.0000
# 6  2006 15  3  3 348 1.5 348.0000
# 7  2007 15  3  3 388 1.5 388.0000
# 8  2008 15  3  3 388 1.5 388.0000
# 9  1999 53 33  1  NA 3.4 170.2034
# 10 2000 53 33  1  NA 3.4 166.3867
# 11 2002 53 33  1  NA 2.9 164.4496
# 12 2003 53 33  1  NA 2.6 165.0018
# 13 2004 53 33  1  NA 2.6 168.6527
# 14 2005 53 33  1 170 3.8 170.0000
# 15 2006 53 33  1 170 3.0 170.0000
# 16 2007 53 33  1 330 4.2 330.0000
# 17 2008 53 33  1 330 5.0 330.0000

You probably want to include year in the ave, it wouldn't have worked otherwise in your sample data.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • can you explain how `na_kalman` works. In `ID=1`, the `MP.imp` decreases and increases. It seems cool. – Peter Chen Mar 07 '20 at 15:44
  • @PeterChen It's based on space state models proposed by [Kalman 1960](http://www.sontaglab.org/kalman4.pdf). The Kalman filter applies a smoothing algorithm based on a level, a trend, and a seasonal component to inter/extrapolate missings, and also handles nonstationary data. It is thus different from linear interpolation. – jay.sf Mar 07 '20 at 15:59
  • Thank you. I thought it based on another column to impute at first, like impute `MP` based on the trend of `PS`... – Peter Chen Mar 07 '20 at 16:01
  • 1
    @PeterChen Actually it's just interpolating the grouped time series in an univariate manner. For multivariate interpolation you could make predictions, based e.g. on [`loess regression`](https://stackoverflow.com/a/27796497/6574038). Or use multiple imputation like [`mice`](https://cran.r-project.org/web/packages/mice/index.html) (parametric) or [`missForest`](https://cran.r-project.org/web/packages/missForest/index.html) (non-parametric). But this leads to a rather statistical discussion, which is better suited to [Cross Validated](https://stats.stackexchange.com/help/on-topic). – jay.sf Mar 07 '20 at 16:18
  • 1
    `loess regression` helps a lot. Since there are lots of `NA` in the column, it seems that I need to research multivariate interpolation. thanks – Peter Chen Mar 07 '20 at 19:54
0

If you have NA values at the beginning and end of the group, you should use na.rm = FALSE to get output of the same size as input. With na.rm = TRUE, the leading and trailing NAs are removed.

Also, you should not group by year because that would give you only one observation in a group.

library(data.table)

setDT(dt)
dt[, MP_interpolate := as.integer(zoo::na.approx(MP, na.rm = FALSE)), .(ST, CC, ID)]

data

dt <- structure(list(Year = c(2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 
2008L, 1999L, 2000L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 
2008L), ST = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 53L, 53L, 53L, 
53L, 53L, 53L, 53L, 53L, 53L), CC = c(3L, 3L, 3L, 3L, 3L, 3L, 
3L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L), ID = c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
MP = c(NA, NA, 193L, 193L, 348L, 388L, 388L, NA, NA, NA, 
NA, NA, 170L, 170L, 330L, 330L), PS = c(1.5, 1.5, 1.5, 1.5, 
1.5, 1.5, 1.5, 3.4, 3.4, 2.9, 2.6, 2.6, 3.8, 3, 4.2, 5)), 
class = "data.frame", row.names = c(NA, -16L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I got `NA` or `TRUE` in the new column. – Peter Chen Mar 07 '20 at 05:44
  • It works for me though. I have also wrapped `as.integer` there so actually you should not be getting logical values. Can you check on the data which I have shared in the post? – Ronak Shah Mar 07 '20 at 05:54
  • Got warning message: `Warning messages: 1: In `[.data.table`(setDT(dt), , `:=`(MP_interpolate, as.integer(na.approx(MP, : Group 1 column 'MP_interpolate': 193 (type 'integer') at RHS position 3 taken as TRUE when assigning to type 'logical' 2: In `[.data.table`(setDT(dt), , `:=`(MP_interpolate, as.integer(na.approx(MP, : Group 2 column 'MP_interpolate': 170 (type 'integer') at RHS position 6 taken as TRUE when assigning to type 'logical'` – Peter Chen Mar 07 '20 at 05:56
  • Do you get this even on the data that I have shared? What is your `packageVersion('data.table')` ? I have `‘1.12.2’`. – Ronak Shah Mar 07 '20 at 05:59
  • `packageVersion('data.table'): ‘1.12.8’` and with your sample, I still got `NA` in the first observation within each group... `$ MP_interpolate: int NA NA 193 193 348 388 388 NA NA NA ...` – Peter Chen Mar 07 '20 at 15:41