2

I am working with high frequency data in Time Series and I would like to get all the business days from my data. My data observations are separated by seconds, so there are 86400 seconds each day and my data set are spread over 31 days (so there are 2,678,400 observations!).

Here is (part) of my data:

In[1]: ts
Out[1]: 
2013-01-01 00:00:00    0.480928
2013-01-01 00:00:01    0.480928
2013-01-01 00:00:02    0.483977
2013-01-01 00:00:03    0.486725
2013-01-01 00:00:04    0.486725
...
2013-01-31 23:59:56    0.451630
2013-01-31 23:59:57    0.451630
2013-01-31 23:59:58    0.451630
2013-01-31 23:59:59    0.454683
Freq: S, Length: 2678400

What I would like to do is to create a new time series which consists of the business days from this month, but I would like to have them with their corresponding data seconds. For example, if 2013-01-02 (WED) until 2013-01-04 (Fri) are the first business days of the first week in January, then:

2013-01-02 00:00:00    0.507477
2013-01-02 00:00:01    0.501373
...
2013-01-03 00:00:00    0.489778
2013-01-03 00:00:01    0.489778
...
2013-01-04 23:59:58    0.598115
2013-01-04 23:59:59    0.598115
Freq: S, Length: 259200

so it will exclude of course all the data on Sat 2013-01-05 and 2013-01-06 since these are the weekend days. and so on...

I tried to use some pandas built-in commands, but couldn't find the right one since they aggregate by day without taking into consideration that each day contains sub columns in them. That is, for each second there is a value and they should not be averaged, just grouped together to a new series..

for example I tried:

  1. ts.asfreq(BDay()) --> finds the business day but averages over each day
  2. ts.resample() --> you must define 'how' (mean, max, min...)
  3. ts.groupby(lambda x : x.weekday) --> not either!
  4. ts = pd.Series(df, index = pd.bdate_range(start = '2013/01/01 00:00:00', end = '2013/01/31 23:59:59' , freq = 'S')) --> df since the original data is as DataFramem. Using pd.bdate_range didn't help since df and index must be in the same dimension..

I searched in pandas documentation, googled but could not find a clue...
Does anybody have an idea?

I would really appreciate your help!

Thanks!

p.s I would rather not use loops for that, since my data set is very large... (I have also other months to analyse)

Apython
  • 453
  • 6
  • 19

2 Answers2

3

Unfortunately this is a little slow, but should at least give the answer you are looking for.

#create an index of just the date portion of your index (this is the slow step)
ts_days = pd.to_datetime(ts.index.date)

#create a range of business days over that period
bdays = pd.bdate_range(start=ts.index[0].date(), end=ts.index[-1].date())

#Filter the series to just those days contained in the business day range.
ts = ts[ts_days.isin(bdays)]
chrisb
  • 49,833
  • 8
  • 70
  • 70
1

Modern pandas stores timestamps as numpy.datetime64 with a nanosecond time unit (one could check that by inspecting ts.index.values). It is much faster to convert both the original index and the one generated by bdate_range to a daily time unit ([D]) and to check the inclusion on these two arrays:

import numpy as np
import pandas

def _get_days_array(index):
    "Convert the index to a datetime64[D] array"
    return index.values.astype('<M8[D]')

def retain_business_days(ts):
    "Retain only the business days"
    tsdays = _get_days_array(ts.index) 
    bdays = _get_days_array(pandas.bdate_range(tsdays[0], tsdays[-1]))
    mask = np.in1d(tsdays, bdays)
    return ts[mask]
Ilya Kolpakov
  • 196
  • 2
  • 8