40

I am new to pandas. I have dataframe with two columns dt (date-time stamp) and value.

Given two start and end data-time stamps: is there a easy way to create a new dataframe from original one that contains rows between the two date-time stamp?

                dt    value
84    7/23/2014 7:00  0.300
85    7/23/2014 7:05  0.300
86    7/23/2014 7:10  0.312
87    7/23/2014 7:15  0.300
88    7/23/2014 7:20  0.300
89    7/23/2014 7:25  0.300
90    7/23/2014 7:30  0.300
91    7/23/2014 7:35  0.300
92    7/23/2014 7:40  0.300
93    7/23/2014 7:45  0.216
94    7/23/2014 7:50  0.204
95    7/23/2014 7:55  0.228
96    7/23/2014 8:00  0.228
97    7/23/2014 8:05  0.228
98    7/23/2014 8:10  0.228
99    7/23/2014 8:15  0.240
100   7/23/2014 8:20  0.228
101   7/23/2014 8:25  0.216
102   7/23/2014 8:30  0.228
103   7/23/2014 8:35  0.324
104   7/23/2014 8:40  0.336
105   7/23/2014 8:45  0.324
106   7/23/2014 8:50  0.324
107   7/23/2014 8:55  0.324
108   7/23/2014 9:00  0.252
109   7/23/2014 9:05  0.252
110   7/23/2014 9:10  0.240
111   7/23/2014 9:15  0.240
112   7/23/2014 9:20  0.252
113   7/23/2014 9:25  0.240
..               ...    ...
198  7/23/2014 16:30  0.132
199  7/23/2014 16:35  0.120
200  7/23/2014 16:40  0.180
201  7/23/2014 16:45  0.216
202  7/23/2014 16:50  0.204
203  7/23/2014 16:55  0.192
orde
  • 5,233
  • 6
  • 31
  • 33
bajie88
  • 509
  • 1
  • 4
  • 6

3 Answers3

66

So long as dt is a datetime dtype already you can filter using date strings, if not then you can convert doing this:

df['dt'] = pd.to_datetime(df['dt'])

Then filter:

In [115]:

df[(df['dt'] > '2014-07-23 07:30:00') & (df['dt'] < '2014-07-23 09:00:00')]
Out[115]:
                       dt  value
index                           
91    2014-07-23 07:35:00  0.300
92    2014-07-23 07:40:00  0.300
93    2014-07-23 07:45:00  0.216
94    2014-07-23 07:50:00  0.204
95    2014-07-23 07:55:00  0.228
96    2014-07-23 08:00:00  0.228
97    2014-07-23 08:05:00  0.228
98    2014-07-23 08:10:00  0.228
99    2014-07-23 08:15:00  0.240
100   2014-07-23 08:20:00  0.228
101   2014-07-23 08:25:00  0.216
102   2014-07-23 08:30:00  0.228
103   2014-07-23 08:35:00  0.324
104   2014-07-23 08:40:00  0.336
105   2014-07-23 08:45:00  0.324
106   2014-07-23 08:50:00  0.324
107   2014-07-23 08:55:00  0.324
EdChum
  • 376,765
  • 198
  • 813
  • 562
8

You can also use query:

In [25]: df.query('"2014-07-23 07:55:00" <= dt <= "2014-07-23 08:20:00"')
Out[25]: 
                     dt  value
95  2014-07-23 07:55:00  0.228
96  2014-07-23 08:00:00  0.228
97  2014-07-23 08:05:00  0.228
98  2014-07-23 08:10:00  0.228
99  2014-07-23 08:15:00  0.240
100 2014-07-23 08:20:00  0.228
rachwa
  • 1,805
  • 1
  • 14
  • 17
  • I got a small problem : "Cannot compare tz-naive and tz-aware datetime-like objects" but i solved it by the following : df.query('"2023-04-01 00:00:00.000Z" <= dt < "2024-10-01 00:00:00.000Z"'). – R13mus Jul 19 '23 at 12:02
6

The answer above is right, but for people who just like me stumble upon this question more than 5 years after it was posted I want to add this remark.

If you want to filter on a sorted column (and timestamps tend to be like one) it is more efficient to use the searchsorted function of pandas Series to reach O(log(n)) complexity instead of O(n).

The example below gives as a result in a difference of much more than a factor 1000. This difference can be made arbitrarily large due to the difference in complexity off course, but the chosen numbers are the ones I was using when I stumbled upon this question.

import pandas as pd
import numpy as np
import timeit

N = 500000
M = 200

data = np.hstack([np.arange(0.,N).reshape(N,1),np.random.randn(N,M-1)])
df = pd.DataFrame(data,columns=["column"+str(i) for i in range(M)])

def return_first(df):
    return df[(df['column0'] > 100.5) & (df['column0'] < 400000.5)]

def return_second(df):
    t1 = df['column0'].searchsorted(100.5)
    t2 = df['column0'].searchsorted(400000.5)
    return df.loc[t1:t2-1]


if __name__ == '__main__':
    t = timeit.timeit(lambda: return_first(df), number=100)
    print(t)
    t = timeit.timeit(lambda: return_second(df), number=100)
    print(t)

results:

59.1751627
0.015401400000001786
  • I stumbled on this answer ! I have a sorted dataframe and this is kind of what I was looking for . Is there an easy way to filter out based on minutes/hours? Lets say I want to get everything which happened at hour mark . So time is X:00:00 . – NewGuyInJava Jun 14 '21 at 21:08
  • 1
    If I understand the question correctly that is dependent on the data. If it is very structured like the one in the question you can do something like df.loc[t1::288] (where t1 is the first mark and there are 288 5min intervals in a day. If the data is not that structured it becomes harder, and maybe a binary-search style algorithm after a good guess can be beneficial. But in the case of very unstructured data it becomes harder and harder to gain much efficiency over the first answer. – DwightFromTheOffice Jun 15 '21 at 07:33
  • Thanks ! I ended up doing something like df.opentime.dt.minute == 0 – NewGuyInJava Jun 17 '21 at 23:04