2

I'm trying to fill in missing data in R. It's a simple variable, with a date.

I'm using the ImputeTS but when I map the output I can tell the data is out. In Excel, when I use a straight line calculation and it appears to be closer. I want to avoid this as I'm over-reliant on too few data points.

library("imputeTS")

org = read.csv(file.choose(),header=T)

m_default = na_kalman(org)
m_auto <- na_kalman(org, model = "auto.arima")
m_struct <- na_kalman(org, model ="StructTS", smooth = TRUE)
m_trends <- na_kalman(org, model ="StructTS", smooth = TRUE, type = "trend")
m_ip <- na_interpolation(org, option ="linear")

[Graphed Results][1] [1]: https://i.stack.imgur.com/ozHDD.jpg

In the image, you can see the excel estimates closer to the line than R.

Below is the data I used as the input.

Thank you

42131 14897320
42161 15309884
42185 na
42191 15736110
42221 16193078
42251 16660808
42277 na
42281 17169827
42311 17710224
42341 18293716
42369 na
42371 18891824
42401 19525236
42431 20202090
42460 na
42461 20913242
42491 21668513
42551 23271395
42575 23918755
42605 24700462
42635 25513112
42643 na
42665 26363177
42695 27247927
42725 28182277
42735 na
42755 29116689
42785 30102583
42809 30962403
42815 31156665
42823 31464561
42825 na
42853 32565105
42883 33710529
42913 34908319
42916 na
42943 36166021
42973 37466067
43003 38813763
43008 na
43033 40247438
43055 41326456
43056 41416270
43063 41741074
43085 42881998
43089 43121038
43100 na
43115 44419898
Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
Donal B
  • 21
  • 1

2 Answers2

0

Is the first variable the date? If this is the case, your time series seems to be irregularly spaced (also called unevenly spaced). imputeTS actually assumes the input to be a regularly spaced time series. That is probably why the results are not as expected. A solution could be making the time series evenly spaced by adding additional timestamps with NA observations and then performing imputation with imputeTS.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
0

If I understood correctly you want to fill in the gaps and see what is the general trend. If this is the case, I personally recommend to use stats::approx() as in the following.

> a <- c(1,2,NA,5, NA, NA, 7) # this would be org[,2]
> stats::approx(a, method = 'linear', n = 7)
$x
[1] 1 2 3 4 5 6 7

$y
[1] 1.000000 2.000000 3.500000 5.000000 5.666667 6.333333 7.000000
Garini
  • 1,088
  • 16
  • 29