0

I have a dataframe which contains stock prices, something like:

id     date        price
1     2/3/2016     4
1     2/4/2016     4.2
1     2/5/2016     3
2     2/4/2016     3
2     2/5/2016     4.1
2     2/6/2016     3
.     .            . 
.     .            . 
.     .            . 

However, not all stocks cover the same time period, as companies leave and join the exchange. I'm looking to subset my data to a continuous time series (ignoring weekends). I face a tradeoff between number of stocks and length of time (i.e. the longer the the time series, the less companies have data available over the entire series).

I believe the optimal method for picking a series is that it contains a start date from a stock run and an end date of a (possibly different) stock run. Any other time period would not be maximizing the time coverage. In the end, I'm expecting a dataframe containing possible start dates, end dates, and number of companies. I can then decide which time range is best suited.

I hope this is clear and please let me know if any clarifications are needed.

Edit: This should help. Here is an crude dataframe

date1 <- seq(as.Date("2006/1/1"), as.Date("2010/1/1"), "days")
date2 <- seq(as.Date("2006/6/12"), as.Date("2008/1/1"), "days")
date3 <- seq(as.Date("2009/10/11"), as.Date("2010/12/1"), "days")
date4 <- seq(as.Date("2002/1/1"), as.Date("2007/1/13"), "days")

row_to_drop1 <- !(weekdays(as.Date(date1)) %in% c('Saturday','Sunday'))
row_to_drop2 <- !(weekdays(as.Date(date2)) %in% c('Saturday','Sunday'))
row_to_drop3 <- !(weekdays(as.Date(date3)) %in% c('Saturday','Sunday'))
row_to_drop4 <- !(weekdays(as.Date(date4)) %in% c('Saturday','Sunday'))

date1 <- date1[row_to_drop1]
date2 <- date2[row_to_drop2]
date3 <- date3[row_to_drop3]
date4 <- date4[row_to_drop4]

mydf <- rbind.data.frame(data.frame(id = 1, date = date1),data.frame(id=2,date=date2),data.frame(id=2,date=date3),data.frame(id=3,date=date4))

I'm hoping for this output:

start       end         #ofids
1/1/2006    1/1/2010    1
6/12/2006   1/1/2010    2
10/11/2009  1/1/2010    3
1/1/2002    1/1/2010    3
1/1/2006    1/1/2008    3
6/12/2006   1/1/2008    1
10/11/2009  1/1/2008    2
1/1/2002    1/1/2008    3
1/1/2006    12/1/2010   1
6/12/2006   12/1/2010   3
10/11/2009  12/1/2010   2
1/1/2002    12/1/2010   2
1/1/2006    1/13/2007   2
6/12/2006   1/13/2007   1
10/11/2009  1/13/2007   1
1/1/2002    1/13/2007   1

The #ofids is just made up, but hopefully it gets the point across. Some rows will be dropped as the start date can't be greater than the end date.

Here are the runs (combo of start dates with all end dates):

start       end
1/1/2006    1/1/2010
6/12/2006   1/1/2010
10/11/2009  1/1/2010
1/1/2002    1/1/2010
1/1/2006    1/1/2008
6/12/2006   1/1/2008
1/1/2002    1/1/2008
1/1/2006    12/1/2010
6/12/2006   12/1/2010
10/11/2009  12/1/2010
1/1/2002    12/1/2010
1/1/2006    1/13/2007
6/12/2006   1/13/2007
1/1/2002    1/13/2007

That should make for 14x3(ids)=42 rows.

Almacomet
  • 211
  • 2
  • 9
  • What's your desired output? Just the range of `date` for each `id`? That's just a simple aggregation. – alistaire Jan 18 '17 at 23:08
  • @alistaire All possible ranges and the number of id's available for that range. A possible range is a pair of start-end dates of a contiguous stretch for a stock (but not necessarily the same stock, I need to take the cartesian product of all start and end dates). – Almacomet Jan 18 '17 at 23:14
  • How are you defining "contiguous"? Days the market is open? That would require another dataset (or you'd have to infer it from what you have). You should edit your example to make it [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610), including the desired result for the sample data. – alistaire Jan 18 '17 at 23:20
  • @alistaire The contiguous days are days the market is open, which is all days except weekends. I've edited my example, hopefully should be a lot clearer now. Thanks. – Almacomet Jan 18 '17 at 23:47
  • Much better, but most markets are closed for some holidays, as well. If you can generate a vector of days the market is open, you can check if each of those days is in `date` for each `id`, and then call `rle` on the result to see what the longest run is. – alistaire Jan 18 '17 at 23:54

1 Answers1

0

Here are two ways of summarizing your data. Firstly, you need to make a vector of when the market is open; here I'll just use weekdays, but real-life calendars tend to be more complicated. An easy way to get such a vector would be to pull data for an index you know has been continuously traded on that exchange during the period in question with quantmod or the like, and pull out the dates.

library(dplyr)    # Use another grammar if you prefer

# Make a vector of dates the market is open
mkt_open <- seq(min(mydf$date), max(mydf$date), '1 day')
mkt_open <- mkt_open[!(weekdays(mkt_open) %in% c('Saturday','Sunday'))]

Runs for each id

One way of calculating is to figure the runs for each id where it is trading or not. Using a full_join to insert NAs in a Boolean variable for whether a company is trading allows the use of data.table::rleid, which returns an ID (sequential integers) for each run. (You can do the same thing with base::rle, but it's more work.) Once grouped, summarizing the data usefully is easy.

mydf %>% mutate(trading = TRUE) %>%    # Add index column which will add NAs in join
    # Insert missing dates for each id
    full_join(crossing(id = unique(.$id), date = mkt_open)) %>% 
    mutate(trading = coalesce(trading, FALSE)) %>%    # Replace NAs with FALSE
    group_by(id) %>%
    arrange(date) %>%    # Sort by date before calculating runs
    group_by(run_id = data.table::rleid(trading), trading, add = TRUE) %>%    # Add ID to separate runs
    summarise(run_length = n(),    # Count rows per group
              start = min(date),
              end = max(date))

#> Source: local data frame [9 x 6]
#> Groups: id, run_id [?]
#> 
#>      id run_id trading run_length      start        end
#>   <dbl>  <int>   <lgl>      <int>     <date>     <date>
#> 1     1      1   FALSE       1044 2002-01-01 2005-12-30
#> 2     1      2    TRUE       1045 2006-01-02 2010-01-01
#> 3     1      3   FALSE        238 2010-01-04 2010-12-01
#> 4     2      1   FALSE       1159 2002-01-01 2006-06-09
#> 5     2      2    TRUE        407 2006-06-12 2008-01-01
#> 6     2      3   FALSE        463 2008-01-02 2009-10-09
#> 7     2      4    TRUE        298 2009-10-12 2010-12-01
#> 8     3      1    TRUE       1314 2002-01-01 2007-01-12
#> 9     3      2   FALSE       1013 2007-01-15 2010-12-01

If you only care about runs where a company is trading, tack on filter(trading) %>% select(-trading).


Runs by number of companies trading

If you're more concerned with the number of companies trading than the individual runs, you can aggregate the ids first:

mydf %>% group_by(date) %>% 
    summarise(no_ids = n_distinct(id)) %>%    # Count ids per date
    full_join(data.frame(date = mkt_open)) %>%    # Add any dates w/o trading
    mutate(no_ids = coalesce(no_ids, 0L)) %>%    # Fill NAs for days w/o trading with 0
    arrange(date) %>%
    group_by(run_id = data.table::rleid(no_ids), no_ids) %>% 
    summarise(run_length = n(), 
              start = min(date), 
              stop = max(date))

#> Source: local data frame [7 x 5]
#> Groups: run_id [?]
#> 
#>   run_id no_ids run_length      start       stop
#>    <int>  <int>      <int>     <date>     <date>
#> 1      1      1       1044 2002-01-01 2005-12-30
#> 2      2      2        115 2006-01-02 2006-06-09
#> 3      3      3        155 2006-06-12 2007-01-12
#> 4      4      2        252 2007-01-15 2008-01-01
#> 5      5      1        463 2008-01-02 2009-10-09
#> 6      6      2         60 2009-10-12 2010-01-01
#> 7      7      1        238 2010-01-04 2010-12-01
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • I think I got it from here. I can take mydf$start and cross it with mydf$stop (after filtering trading == true). Then, I cross that with mydf and check that it's in the the range. Thanks! – Almacomet Jan 19 '17 at 03:44
  • Tack on `%>% ungroup() %>% filter(trading) %>% select(start, end) %>% unlist() %>% unique() %>% combn(2) %>% .[, .[1,] < .[2,]] %>% t() %>% as_data_frame() %>% setNames(c('start', 'end')) %>% mutate_all(as.Date) %>% arrange(start, end)` to the first one? I'm not sure what you're looking for for these ranges. Max companies active? Min? It may be more useful to make some visualizations so you can see what's going on. – alistaire Jan 19 '17 at 03:47