1
 2012-10-08 07:12:22            0.0    0          0  2315.6    0     0.0    0
 2012-10-08 09:14:00         2306.4   20  326586240  2306.4  472  2306.8    4
 2012-10-08 09:15:00         2306.8   34  249805440  2306.8  361  2308.0   26
 2012-10-08 09:15:01         2308.0    1   53309040  2307.4   77  2308.6    9
 2012-10-08 09:15:01.500000  2308.2    1  124630140  2307.0  180  2308.4    1
 2012-10-08 09:15:02         2307.0    5   85846260  2308.2  124  2308.0    9
 2012-10-08 09:15:02.500000  2307.0    3  128073540  2307.0  185  2307.6   11
 ......
 2012-10-09 07:19:30            0.0    0          0  2276.6    0     0.0    0
 2012-10-09 09:14:00         2283.2   80   98634240  2283.2  144  2283.4    1
 2012-10-09 09:15:00         2285.2   18  126814260  2285.2  185  2285.6    3
 2012-10-09 09:15:01         2285.8    6   98719560  2286.8  144  2287.0   25
 2012-10-09 09:15:01.500000  2287.0   36  144759420  2288.8  211  2289.0    4
 2012-10-09 09:15:02         2287.4    6  109829280  2287.4  160  2288.6    5
 ......

I have a DataFrame contains several days of exchange trading data as above. The the data I want to have is from 9:00:00AM - 11:30:00AM and 13:00:00 - 15:15:00, so I would like to do two things,

  1. for each date in the DataFrame truncate to only have data in the range of 9:00:00AM - 11:30:00AM and 13:00:00 - 15:15:00
  2. with the range in 1., fill missing data with a frequency of 500 milliseconds

the pandas truncate functions only allows me to truncate according to date, but I would like to truncate according to datetime.time here. Also how to fill the missing data only for the interval I am interested.

Thanks a lot.

tzelleke
  • 15,023
  • 5
  • 33
  • 49
tesla1060
  • 2,621
  • 6
  • 31
  • 43

1 Answers1

2
  1. for each date in the DataFrame truncate to only have data in the range of 9:00:00AM - 11:30:00AM and 13:00:00 - 15:15:00

Use index slicing for that, e.g.:

df = df[start_timestamp:end_timestamp]
  1. with the range in 1., fill missing data with a frequency of 500 milliseconds

Generate a new dataframe with an index at 500 msec. Merge this dataframe with the original one using outer join. This gets you a dataframe with rows at regular intervals. Rows for missing observations will contain NaN values. Then fill missing NaN values with fillna.

Example:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: data = pd.DataFrame({"value": np.arange(5)}, index=pd.date_range("2013/02/03", periods=5, freq="3Min"))

In [4]: data
Out[4]: 
                     value
2013-02-03 00:00:00      0
2013-02-03 00:03:00      1
2013-02-03 00:06:00      2
2013-02-03 00:09:00      3
2013-02-03 00:12:00      4

In [5]: filler = pd.DataFrame({"value": [100] * 15}, index=pd.date_range("2013/02/03", periods=15, freq="1Min"))                                                                           

In [6]: filler
Out[6]: 
                     value
2013-02-03 00:00:00    100
2013-02-03 00:01:00    100
2013-02-03 00:02:00    100
2013-02-03 00:03:00    100
2013-02-03 00:04:00    100
2013-02-03 00:05:00    100
2013-02-03 00:06:00    100
2013-02-03 00:07:00    100
2013-02-03 00:08:00    100
2013-02-03 00:09:00    100
2013-02-03 00:10:00    100
2013-02-03 00:11:00    100
2013-02-03 00:12:00    100
2013-02-03 00:13:00    100
2013-02-03 00:14:00    100

In [7]: merged = filler.merge(data, how='left', left_index=True, right_index=True)                                                                                                         

In [8]: merged["value"] = np.where(np.isfinite(merged.value_y), merged.value_y, merged.value_x)                                                                                            

In [9]: merged
Out[9]: 
                     value_x  value_y  value
2013-02-03 00:00:00      100        0      0
2013-02-03 00:01:00      100      NaN    100
2013-02-03 00:02:00      100      NaN    100
2013-02-03 00:03:00      100        1      1
2013-02-03 00:04:00      100      NaN    100
2013-02-03 00:05:00      100      NaN    100
2013-02-03 00:06:00      100        2      2
2013-02-03 00:07:00      100      NaN    100
2013-02-03 00:08:00      100      NaN    100
2013-02-03 00:09:00      100        3      3
2013-02-03 00:10:00      100      NaN    100
2013-02-03 00:11:00      100      NaN    100
2013-02-03 00:12:00      100        4      4
2013-02-03 00:13:00      100      NaN    100
2013-02-03 00:14:00      100      NaN    100

In [10]: merged['2013-02-03 00:01:00':'2013-02-03 00:10:00']                                                                                                                                
Out[10]: 
                     value_x  value_y  value
2013-02-03 00:01:00      100      NaN    100
2013-02-03 00:02:00      100      NaN    100
2013-02-03 00:03:00      100        1      1
2013-02-03 00:04:00      100      NaN    100
2013-02-03 00:05:00      100      NaN    100
2013-02-03 00:06:00      100        2      2
2013-02-03 00:07:00      100      NaN    100
2013-02-03 00:08:00      100      NaN    100
2013-02-03 00:09:00      100        3      3
2013-02-03 00:10:00      100      NaN    100
Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • thanks, as you can see my index is in the form of a complete timestamp like `2012-10-08 07:12:22`, is your `[start_timestamp:end_timestamp]` in the form of datetime.time? if it is, it seems not working. it throws an invalid slicing exception. – tesla1060 Feb 03 '13 at 13:39
  • @tesla1060 added a complete example – Maxim Egorushkin Feb 03 '13 at 14:39
  • thanks for the example, that has solved my 2nd question. But the first one, the way you deal it is `merged['2013-02-03 00:01:00':'2013-02-03 00:10:00']`, you are assuming you know the date to be `2013-02-03`, my problem is I have multiple dates, and on each date, I would like the data from `00:01:00` to `00:10:00`, is there an easier way to achieve that, other than specifying the full timestamp`['2013-02-03 00:01:00':'2013-02-03 00:10:00']` , but maybe just use the datetime.time part `['00:01:00':'00:10:00']` – tesla1060 Feb 03 '13 at 14:46
  • @tesla1060 You can probably create a two-level index `['date','time']` and then apply time filtering for the second level, but that is beyond my current level of pandas-fu now. – Maxim Egorushkin Feb 03 '13 at 15:13