0

I would like to create a prediction model from a time series. I have a data frame which include 2 column (Date and Cases). Date column is going from 2008-01-01 to 2013-12-01. Cases has some number for each month (However, more than 30 out of 72 observation has value of NA.) As result, I want to create a prediction model to predict Cases in the next 3-4 month after 2013-12-01? Can anyone help me?

Here is output of dput(my data)

structure(list(Date2 = structure(c(13879, 13910, 13939, 13970, 
14000, 14031, 14061, 14092, 14123, 14153, 14184, 14214, 14245, 
14276, 14304, 14335, 14365, 14396, 14426, 14457, 14488, 14518, 
14549, 14579, 14610, 14641, 14669, 14700, 14730, 14761, 14791, 
14822, 14853, 14883, 14914, 14944, 14975, 15006, 15034, 15065, 
15095, 15126, 15156, 15187, 15218, 15248, 15279, 15309, 15340, 
15371, 15400, 15431, 15461, 15492, 15522, 15553, 15584, 15614, 
15645, 15675, 15706, 15737, 15765, 15796, 15826, 15857, 15887, 
15918, 15949, 15979, 16010, 16040), class = "Date"), Cases = c(16352L, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 10L, NA, 23L, 138L, NA, 18L, 
NA, 3534L, 43L, NA, 3L, 118L, NA, 172L, 4194L, NA, 9L, 2L, 162L, 
NA, 112L, 115L, NA, NA, 119L, NA, NA, 172L, NA, 25L, NA, NA, 
11L, 4L, 457L, 56L, NA, 148L, 446L, 30L, NA, NA, NA, NA, NA, 
NA, NA, 583L, NA, 180L, 193L, NA, 77L, NA, 18L, 15L, NA, NA, 
1L, NA, NA, NA)), .Names = c("Date2", "Cases"), row.names = c(NA, 
-72L), class = "data.frame")

Thank you in advance for your contribution.

Jake Burkhead
  • 6,435
  • 2
  • 21
  • 32
Ram
  • 359
  • 1
  • 6
  • 15
  • It's somewhat difficult to make predictions when over half of the historical the data is not available. – Rich Scriven Feb 28 '14 at 03:14
  • I guess so. But, what if the dataset did not have NA values, how would I created prediction model? Since, I will apply this prediction model to another data frame which has same columns (Cases and Date, but has a few NA). – Ram Feb 28 '14 at 03:29
  • Check out the packages `forecast` and `astsa`. I'll see if I can create a general answer. – Rich Scriven Feb 28 '14 at 03:52
  • Thank you very much. I will check out as soon as possible. – Ram Feb 28 '14 at 17:06

1 Answers1

0

Maybe this can get you started, but making predictions is hard and requires understanding your data well. The information presented here isn't really enough to make good predictions IMO. This is a generalized linear model with cases as function of days since first observation and month of the year, since just eyeballing the data it looks like counts may be related to months and are decreasing with years.

library(ggplot2)
dat <- dats[complete.cases(dats),]
dat$days <- dat$Date2 - dat$Date2[1]
mod2 <- glm(Cases ~ days + format(Date2, "%m"), data = dat, family = poisson())
dat$predicted <- "observed"

## See how the model performed against old data
dat <- rbind(dat, data.frame(
    Date2 = dat$Date2,
    Cases = predict(mod2, type = "response"),
    predicted = "predicted",
    days = dat$days))

## predict future cases
futureDates <- seq(as.Date("2014/1/1"), by = "month", length.out = 12)
future <- data.frame(
    Date2 = futureDates,
    days = futureDates - dat$Date2[1])

datFuture <- rbind(dat, data.frame(Date2 = future$Date2,
                             days = future$days,
                             Cases = predict(mod2, type = "response", newdata = future),
                             predicted = "predicted"))

ggplot(datFuture, aes(Date2, Cases, col = factor(predicted), group = predicted)) +
    geom_point(pch = 3) + ylab("Predicted Cases") + xlab("Date") +
    geom_line(lty = 2, lwd = 1.5, alpha = 0.2) +
    geom_smooth(alpha = 0.1, fill = NA)

Results look like this

Rorschach
  • 31,301
  • 5
  • 78
  • 129
  • Thats awesome. Really appreciated. I wanna ask you some questions about the graph( questions can be really easy, but I need to understand well.) 1)There are two dashed line in the graph, what is exactly meaning of these lines? 2)Also, what are the meaning of observed and predicted line? and/or What is the differences between these two lines? How should I interpret these one? 3) In my graph, I see the number in y line like 1e+05, 3e-05. How can I convert them to normal number like in your graph. Thank you so much. – Ram Feb 28 '14 at 17:03
  • @Ram the dashed lines just connect observed and predicted points, they have no statistical meaning, the solid lines are loess curves (default fitting from the `geom_smooth()`). Those numbers are scientific notation, just change [scaling](http://stackoverflow.com/questions/14563989/force-r-to-stop-plotting-abbreviated-axis-labels-e-g-1e00-in-ggplot2). All of the lines are just exploratory, look into glm models for the statistical analysis or ask around [here](http://stats.stackexchange.com/). – Rorschach Feb 28 '14 at 19:20
  • just last question. When I look at begining of the solid lines, observed is about 14.000 and predicted is about 9000? Where this differences come from? I am asking this, since I suppose that observed data should be same with the actual data. For example, lets look at first month ( 2008-01-01 ) Number of cases in this month is 16352 which is the dashed line's start point. In this point observed data is about 14.000. I suppose that these two points should be same. Could you tell me why they are different? – Ram Feb 28 '14 at 20:55
  • @Ram the solid line is a loess curve fit to the observed data, consult `?geom_smooth()` for more information on specific types of fits available – Rorschach Feb 28 '14 at 22:48