0

I am using Python dask to process a large csv panel data set (10GB+). The dataframe looks like

 stock     date     time   spread  time_diff 
  VOD      01-01    9:05    0.01     0:07     
  VOD      01-01    9:12    0.03     0:52     
  VOD      01-01   10:04    0.02     0:11
  VOD      01-01   10:15    0.01     0:10     
  VOD      01-01   10:25    0.03     0:36      
  VOD      01-01   11:01    0.02     0:03
  VOD      01-01   10:04    0.02     0:09  
  VOD      01-01   10:15    0.01     0:10   
  VOD      01-01   10:25    0.03     0:39  
  VOD      01-01   11:04    0.02    22:00 
  VOD      01-02    9:04    0.02     0:05
  ...       ...     ...     ....     ...
  BAT      01-01    13:05   0.04    10:02
  BAT      01-02    9:07    0.05     0:03
  BAT      01-02    9:10    0.06     0:04

The column time_diff is time difference between two observations by using code: df['time_diff']=df['time'].shift(-1)-df['time']

I want to calculate the time-weighted spread for each stock in each day, but I got problem in deleting/excluding the last observation in each stock-day. i.e. I want to delete/exclude those observations

 stock     date     time   spread  time_diff 
  VOD      01-01    11:04   0.02    22:00 
  BAT      01-01    13:05   0.04    10:02
  ...       ...     ...     ....     ...

because their time difference include the next day's variable.

The question here is how to delete/exclude the last observations for each stock-day in pandas or dask dataframe. Thank you.

rpanai
  • 12,515
  • 2
  • 42
  • 64
FlyUFalcon
  • 314
  • 1
  • 4
  • 18

1 Answers1

1

First group by on columns to get 'per stock per day' data. Then, use tail with negative indexing to exclude last row.

df = df.groupby(['stock','date']).apply(lambda x: x.tail(-1))
  • This looks like [this](https://stackoverflow.com/a/55907156/4819376) answer and doesn't seem to work. – rpanai May 02 '19 at 12:38
  • hi @Aarsh Trivedi. Thanks for this. I got a warning UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected. Before: .apply(func) After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result or: .apply(func, meta=('x', 'f8')) for series result – FlyUFalcon May 02 '19 at 13:05
  • I think its because the dataset is dask dataframe instead of pandas dataframe. – FlyUFalcon May 02 '19 at 13:06
  • hi @rpanai, the answer does works to me as negative indexing has been implemented in pandas in recent versions. – Aarsh Trivedi May 03 '19 at 06:44
  • @FlyUFalcon, are you able to get desired output besides the warning? – Aarsh Trivedi May 03 '19 at 06:45
  • No. I think its because I am use Dask instead of Pandas dataframe. Dask has wired groupby function. – FlyUFalcon May 03 '19 at 08:40
  • @AarshTrivedi It doesn't work for me neither in `pandas` or `dask` do you mind to create a reproducible example? – rpanai May 03 '19 at 12:07