Creating with time series from a dataset including missing values

Question

I need to create a time series from a data frame. The problem is variables is not well-ordered. Data frame is like below

Cases Date 15 1/2009 30 3/2010 45 12/2013

I have 60 observations like that. As you can see, data was collected randomly, which is starting from 1/2008 and ending 12/2013 ( There are many missing values(cases) in bulk of the months between these years). My assumption will be there is no cases in that months. So, how can I convert this dataset as time series? Then, I will try to make some prediction for possible number of cases in future.

No, I need to convert this data frame in to time series to create a prediction model for cases will likely occur in future. The problem is month in date column is non-regularly spaced.( At least, i suppose that this will be a problem.) — Ram, Feb 26 '14 at 02:26
Date column is date. But, for example there is 2/2009 two times. Also, should I have NA values for month which doesn't have cases to create a predictive model? And can you suggest me a resource to create a predictive model? — Ram, Feb 26 '14 at 02:35
Since you have just 60 data point why don't you include here ? — Jd Baba, Feb 26 '14 at 02:40
Yes exactly. Then I want to create a prediction model to create next 2-3 months. — Ram, Feb 26 '14 at 02:41
I am not able to answer my own question because of the reputation :( . When I tried to type here, the data getting too messy. Do you have any idea about how can I share my data here? — Ram, Feb 26 '14 at 02:51
structure(list(Cases = c(15L, 15L, 30L, 11L, 20L, 90L, 15L, 56L, 323L, 107L, 12L, 38L, 48L, 95L, 240L, 43L, 115L, 142L, 4185L, 105L, 16352L, 172L, 119L, 148L, 131L, 150L, 193L, 10L, 23L, 46L, 66L, 26L, 9L, 12L, 112L, 43L, 61L, 119L, 47L, 35L, 10L, 4L, 30L, 196L, 3L, 9L, 29L, 12L, 9L, 3L, 20L, 1L, 57L, 3502L, 1L, 9L, 1L, 15L, 50L), — Ram, Feb 26 '14 at 03:06
Date2 = structure(c(15857, 15826, 15675, 15156, 14396, 15006, 15065, 15248, 15218, 15614, 15614, 15614, 15614, 15614, 15614, 15614, 14822, 14730, 14610, 15218, 13879, 14579, 15340, 15309, 15340, 15675, 15706, 14184, 14245, 14276, 14276, 14276, 14335, 14396, 14791, 14426, 14518, 14914, 15006, 15006, 15065, 15187, 15371, 15340, 15826, 14610, 15218, 15765, 14335, 14488, 14730, 15949, 14518, 14396, 14700, 14669, 14700, 15765, 15765), class = "Date")), — Ram, Feb 26 '14 at 03:06
.Names = c("Cases", "Date2"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 54L, 55L, 56L, 57L, 58L, 59L, 60L), class = "data.frame", na.action = structure(53L, .Names = "53", class = "omit")) — Ram, Feb 26 '14 at 03:06
Please put the data in the question instead. And for next time, read about how to produce a [**minimal, reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) — Henrik, Feb 26 '14 at 07:29

Rorschach · Accepted Answer · 2014-02-26T03:21:01.227

0

Try installing the plyr library,

install.packages("plyr")

and then to sum duplicated Date2 rows:

library(plyr)
mergedData <- ddply(dat, .(Date2), .fun = function(x) {
    data.frame(Cases = sum(x$Cases))
})

> head(mergedData)
       Date2 Cases
1 2008-01-01 16352
2 2008-11-01    10
3 2009-01-01    23
4 2009-02-01   138
5 2009-04-01    18
6 2009-06-01  3534

edited Feb 26 '14 at 03:21

answered Feb 26 '14 at 02:16

Rorschach

31,301
5
78
129

Now, I have no repetitive dates after applying code. Also, I converted my Date2 column as regularly spaced ( starting from 2008-01-01 to 2013-12-31 (space is one month)). So I have some NA values. How can I create time series from this data frame which include NA values by using tm package ( or other package). I hope I explained my question clearly. – Ram Feb 26 '14 at 13:56
Ok, I understood and plot this data. I have last question if you don't mind. Now, my plot's x-line is date ( ending in 2013-12-31) and y-line is Cases, how can I predict number of cases in january 2014? I thought "lm" function don't work in time series prediction. – Ram Feb 27 '14 at 00:27

score 0 · Answer 2 · answered Feb 26 '14 at 02:52

0

you can create a separate sequence of time series and merge with data series.This will create a complete time series with missing values as NA. if df is your data frame with Date as column of date than create new time series ts and merge as below.

ts <- data.frame(Date = seq(as.Date("2008-01-01"), as.Date("2013-12-31"), by="1 month")) dfwithmisisng <- merge(ts, df, by="Date", all=T)

answered Feb 26 '14 at 02:52

Cirrus

638
3
13
26

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column . I got this error – Ram Feb 26 '14 at 03:10
@Ram you should define Date column in your data first and name Date. There should be a common column to merge. – Cirrus Feb 26 '14 at 03:19

Creating with time series from a dataset including missing values

2 Answers2