I have a dataframe which contains stock prices, something like:
id date price
1 2/3/2016 4
1 2/4/2016 4.2
1 2/5/2016 3
2 2/4/2016 3
2 2/5/2016 4.1
2 2/6/2016 3
. . .
. . .
. . .
However, not all stocks cover the same time period, as companies leave and join the exchange. I'm looking to subset my data to a continuous time series (ignoring weekends). I face a tradeoff between number of stocks and length of time (i.e. the longer the the time series, the less companies have data available over the entire series).
I believe the optimal method for picking a series is that it contains a start date from a stock run and an end date of a (possibly different) stock run. Any other time period would not be maximizing the time coverage. In the end, I'm expecting a dataframe containing possible start dates, end dates, and number of companies. I can then decide which time range is best suited.
I hope this is clear and please let me know if any clarifications are needed.
Edit: This should help. Here is an crude dataframe
date1 <- seq(as.Date("2006/1/1"), as.Date("2010/1/1"), "days")
date2 <- seq(as.Date("2006/6/12"), as.Date("2008/1/1"), "days")
date3 <- seq(as.Date("2009/10/11"), as.Date("2010/12/1"), "days")
date4 <- seq(as.Date("2002/1/1"), as.Date("2007/1/13"), "days")
row_to_drop1 <- !(weekdays(as.Date(date1)) %in% c('Saturday','Sunday'))
row_to_drop2 <- !(weekdays(as.Date(date2)) %in% c('Saturday','Sunday'))
row_to_drop3 <- !(weekdays(as.Date(date3)) %in% c('Saturday','Sunday'))
row_to_drop4 <- !(weekdays(as.Date(date4)) %in% c('Saturday','Sunday'))
date1 <- date1[row_to_drop1]
date2 <- date2[row_to_drop2]
date3 <- date3[row_to_drop3]
date4 <- date4[row_to_drop4]
mydf <- rbind.data.frame(data.frame(id = 1, date = date1),data.frame(id=2,date=date2),data.frame(id=2,date=date3),data.frame(id=3,date=date4))
I'm hoping for this output:
start end #ofids
1/1/2006 1/1/2010 1
6/12/2006 1/1/2010 2
10/11/2009 1/1/2010 3
1/1/2002 1/1/2010 3
1/1/2006 1/1/2008 3
6/12/2006 1/1/2008 1
10/11/2009 1/1/2008 2
1/1/2002 1/1/2008 3
1/1/2006 12/1/2010 1
6/12/2006 12/1/2010 3
10/11/2009 12/1/2010 2
1/1/2002 12/1/2010 2
1/1/2006 1/13/2007 2
6/12/2006 1/13/2007 1
10/11/2009 1/13/2007 1
1/1/2002 1/13/2007 1
The #ofids is just made up, but hopefully it gets the point across. Some rows will be dropped as the start date can't be greater than the end date.
Here are the runs (combo of start dates with all end dates):
start end
1/1/2006 1/1/2010
6/12/2006 1/1/2010
10/11/2009 1/1/2010
1/1/2002 1/1/2010
1/1/2006 1/1/2008
6/12/2006 1/1/2008
1/1/2002 1/1/2008
1/1/2006 12/1/2010
6/12/2006 12/1/2010
10/11/2009 12/1/2010
1/1/2002 12/1/2010
1/1/2006 1/13/2007
6/12/2006 1/13/2007
1/1/2002 1/13/2007
That should make for 14x3(ids)=42 rows.