I am using Python dask
to process a large csv panel data set (10GB+).
The dataframe looks like
stock date time spread time_diff
VOD 01-01 9:05 0.01 0:07
VOD 01-01 9:12 0.03 0:52
VOD 01-01 10:04 0.02 0:11
VOD 01-01 10:15 0.01 0:10
VOD 01-01 10:25 0.03 0:36
VOD 01-01 11:01 0.02 0:03
VOD 01-01 10:04 0.02 0:09
VOD 01-01 10:15 0.01 0:10
VOD 01-01 10:25 0.03 0:39
VOD 01-01 11:04 0.02 22:00
VOD 01-02 9:04 0.02 0:05
... ... ... .... ...
BAT 01-01 13:05 0.04 10:02
BAT 01-02 9:07 0.05 0:03
BAT 01-02 9:10 0.06 0:04
The column time_diff is time difference between two observations by using code:
df['time_diff']=df['time'].shift(-1)-df['time']
I want to calculate the time-weighted spread for each stock in each day, but I got problem in deleting/excluding the last observation in each stock-day. i.e. I want to delete/exclude those observations
stock date time spread time_diff
VOD 01-01 11:04 0.02 22:00
BAT 01-01 13:05 0.04 10:02
... ... ... .... ...
because their time difference include the next day's variable.
The question here is how to delete/exclude the last observations for each stock-day in pandas
or dask
dataframe. Thank you.