2

I have a pandas dataframe in which each row represents a period of time. There is a start_date column, containing the datetime the period started, and an end_date column, containing the datetime the period ended.

Normally, I will query this dataset by filtering for rows whose start and end dates are relative to some other date. For example, if I want to filter for all rows where the period contains a certain date, I will do something like this:

df = df[(df['start_date' < my_date]) & ('end_date' > my_date)]

There are a variety of queries like this which are fine and non-problematic. I can do something similar to filter for periods which start between two dates, or end between two dates, or are enclosed by two dates. This is all fine and well and good.

I now have a different kind of date-related criteria I want to run: I want to find all rows whose periods enclose a certain day of the year, (e.g. April 6th).

So, how do I find all rows in this dataframe whose start and end dates enclose a date with a day value of 6 and a month value of 4?

The only way I can think of to do it is to construct a range of dates of April 6th, (1970-04-06, 1971-04-06... ) then run the above date comparison on all of them, and take the bitwise OR of all those comparisons to get the indices of periods that enclose one of those April 6th dates.

This would work, but it feels like a kludge. Is there a more elegant way of doing this task?

EDIT: By request, here is an example of start dates, end dates, and whether or not they contain an April 6th:

             start_date            end_date contains_april_6th
0  2012-07-10 23:06:07 2014-12-30 15:07:09               True
1  2011-10-21 13:56:20 2011-10-25 12:54:26              False
2  2011-04-23 03:25:00 2013-12-10 23:08:02               True
3  2014-08-10 01:14:03 2016-01-23 22:23:52               True
4  2017-04-23 17:15:43 2017-06-03 14:02:19              False
5  2012-10-01 01:42:00 2016-02-05 04:17:16               True
6  2015-08-21 19:59:45 2015-09-06 03:58:59              False
7  2016-06-10 23:03:14 2016-10-26 23:00:53              False
8  2011-08-12 20:50:34 2013-09-22 23:03:05               True
9  2012-11-04 00:45:24 2013-08-15 02:39:36               True
10 2016-03-28 07:51:55 2017-10-07 13:07:57               True
11 2010-08-03 06:42:04 2012-01-04 09:04:00               True
12 2010-05-11 01:22:27 2012-12-30 05:51:48               True
13 2011-12-30 01:19:28 2014-06-20 09:13:32               True
14 2011-11-12 18:04:15 2013-01-05 19:47:20               True
15 2014-04-13 10:40:25 2015-06-08 11:00:44               True
16 2012-10-23 05:43:02 2014-11-04 11:53:35               True
17 2011-03-18 05:00:50 2012-07-30 17:24:36               True
18 2010-04-24 10:00:09 2011-02-21 23:26:50              False
19 2012-02-09 08:05:07 2016-04-06 14:24:22               True
20 2012-04-05 08:39:50 2014-03-11 01:22:43               True
21 2011-07-30 19:19:32 2017-02-14 17:04:36               True
22 2010-01-02 23:48:45 2012-12-16 14:42:24               True
23 2011-11-08 23:49:27 2012-03-11 22:17:14              False
24 2013-02-25 02:27:50 2014-08-02 00:09:07               True
25 2010-07-11 18:56:22 2014-09-06 20:38:50               True
26 2012-11-23 02:08:11 2013-02-26 18:51:47              False
27 2011-05-31 09:02:02 2017-08-01 01:58:09               True
28 2012-10-06 10:09:32 2017-01-03 00:29:34               True
29 2015-10-18 22:37:40 2017-10-06 17:05:42               True
30 2014-06-08 12:10:40 2017-09-30 02:23:40               True
31 2014-06-15 09:33:27 2015-04-17 06:10:28               True
32 2015-06-04 07:32:59 2017-09-22 21:35:34               True
33 2012-09-05 22:24:04 2013-04-18 04:04:50               True
34 2012-07-09 10:44:32 2017-09-27 10:30:33               True
35 2013-01-16 11:50:53 2015-07-01 15:06:51               True
36 2012-05-12 23:14:08 2016-03-31 02:08:50               True
37 2013-02-12 21:21:58 2015-03-13 01:36:45               True
38 2012-08-12 05:07:33 2016-04-21 06:45:48               True
39 2012-09-08 20:40:18 2013-08-12 22:05:30               True
40 2010-10-20 02:05:23 2011-08-18 06:37:04               True
41 2010-09-27 07:44:21 2012-11-25 04:20:47               True
42 2014-09-04 22:27:01 2015-05-18 21:44:14               True
43 2012-01-06 15:35:33 2016-03-02 00:46:20               True
44 2013-09-06 11:39:50 2017-08-21 07:39:03               True
45 2011-01-01 00:58:45 2014-03-22 03:40:25               True
46 2015-07-03 04:02:56 2016-02-16 20:34:17              False
47 2010-06-20 15:24:54 2015-11-17 08:05:10               True
48 2011-06-24 02:39:13 2015-01-13 18:28:24               True
49 2011-09-30 06:53:27 2011-11-13 12:11:41              False
50 2015-10-11 03:13:35 2016-06-29 01:00:01               True
R Hill
  • 1,744
  • 1
  • 22
  • 35

2 Answers2

2

I think you are looking for checking if a date without any year mentioned is between certain time. You can create an interval index then check if the year is in between i.e (Reference)

# A custom function to return true of certain date is in between range
def get_date(x):
    le = x.name.left.year
    ri = x.name.right.year
    # Create a date based on the interval years 
    rng = pd.to_datetime(['04-06-{}'.format(str(i)) for i in range(le-1,ri+1)])
    # Return true if any of the above date is present in the interval
    m = any([i in x.name for i in rng])
    return m

df.index = pd.IntervalIndex.from_arrays(df['start_date'],df['end_date'])

df['new'] = df.apply(get_date,1)

Output :

print(df.reset_index(drop=True))
         start_date            end_date  contains_april_6th    new
0 2012-07-10 23:06:07 2014-12-30 15:07:09                True   True
1 2011-10-21 13:56:20 2011-10-25 12:54:26               False  False
2 2011-04-23 03:25:00 2013-12-10 23:08:02                True   True
3 2014-08-10 01:14:03 2016-01-23 22:23:52                True   True
4 2017-04-23 17:15:43 2017-06-03 14:02:19               False  False
5 2012-10-01 01:42:00 2016-02-05 04:17:16                True   True
6 2015-08-21 19:59:45 2015-09-06 03:58:59               False  False
7 2016-06-10 23:03:14 2016-10-26 23:00:53               False  False
8 2011-08-12 20:50:34 2013-09-22 23:03:05                True   True
9 2012-11-04 00:45:24 2013-08-15 02:39:36                True   True

I would go for @jezreal's solution. Here I used apply which is really time consuming as the dataframe's length increases.

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
2

I think you need np.logical_or and reduce with all masks created by list comprehension:

#get minimal and max years
stacked = df[['start_date','end_date']].stack()
min_date = stacked.min().year
max_date = stacked.max().year
print (min_date)
print (max_date)
2010
2017

#for each date create mask
masks = [(df['start_date'] < pd.Timestamp(x,4,6)) & 
         (df['end_date'] > pd.Timestamp(x,4,6)) for x in range(min_date, max_date + 1)]

#reduce all masks with OR
df['new'] = np.logical_or.reduce(masks)

print (df)
            start_date            end_date  contains_april_6th    new
0  2012-07-10 23:06:07 2014-12-30 15:07:09                True   True
1  2011-10-21 13:56:20 2011-10-25 12:54:26               False  False
2  2011-04-23 03:25:00 2013-12-10 23:08:02                True   True
3  2014-08-10 01:14:03 2016-01-23 22:23:52                True   True
4  2017-04-23 17:15:43 2017-06-03 14:02:19               False  False
5  2012-10-01 01:42:00 2016-02-05 04:17:16                True   True
6  2015-08-21 19:59:45 2015-09-06 03:58:59               False  False
7  2016-06-10 23:03:14 2016-10-26 23:00:53               False  False
8  2011-08-12 20:50:34 2013-09-22 23:03:05                True   True
9  2012-11-04 00:45:24 2013-08-15 02:39:36                True   True
10 2016-03-28 07:51:55 2017-10-07 13:07:57                True   True
11 2010-08-03 06:42:04 2012-01-04 09:04:00                True   True
12 2010-05-11 01:22:27 2012-12-30 05:51:48                True   True
13 2011-12-30 01:19:28 2014-06-20 09:13:32                True   True
14 2011-11-12 18:04:15 2013-01-05 19:47:20                True   True
15 2014-04-13 10:40:25 2015-06-08 11:00:44                True   True
16 2012-10-23 05:43:02 2014-11-04 11:53:35                True   True
17 2011-03-18 05:00:50 2012-07-30 17:24:36                True   True
18 2010-04-24 10:00:09 2011-02-21 23:26:50               False  False
19 2012-02-09 08:05:07 2016-04-06 14:24:22                True   True
20 2012-04-05 08:39:50 2014-03-11 01:22:43                True   True
21 2011-07-30 19:19:32 2017-02-14 17:04:36                True   True
22 2010-01-02 23:48:45 2012-12-16 14:42:24                True   True
23 2011-11-08 23:49:27 2012-03-11 22:17:14               False  False
24 2013-02-25 02:27:50 2014-08-02 00:09:07                True   True
25 2010-07-11 18:56:22 2014-09-06 20:38:50                True   True
26 2012-11-23 02:08:11 2013-02-26 18:51:47               False  False
27 2011-05-31 09:02:02 2017-08-01 01:58:09                True   True
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Each of the intervals you test in this example do include an April 6th. They all include April 6th 2016. I do appreciate that I haven't provided test cases in my original question (I am now putting some together and will edit the original question to include them), but I don't think this method is salvageable. – R Hill Oct 30 '17 at 13:46
  • Really interesting question, sorry, I first dont understand it ;) – jezrael Oct 30 '17 at 14:14
  • 1
    This is the same solution as I propose at the end of the question, but it's a very concise way of doing so which I wasn't aware of. I'll leave the question running for a few days to see if my mythical more-elegant-solution comes along, but if not, I'll mark this as the accepted answer. – R Hill Oct 30 '17 at 14:14
  • When did you update? Sir I was working on my solution so much. Anyway I upvoted you answer when it came. – Bharath M Shetty Oct 30 '17 at 14:34
  • Me too ;) 2 hours min ;) – jezrael Oct 30 '17 at 14:35
  • What do you think of my solution ? – Bharath M Shetty Oct 30 '17 at 14:36
  • 1
    I think my solution, I am working 2 hours for it minimal ;) – jezrael Oct 30 '17 at 14:36
  • 1
    Oh its been an hour for me. Time runs so fast when indulged in solving a problem. – Bharath M Shetty Oct 30 '17 at 14:38
  • 1
    Your solution is super dooper fast. I used apply its damn slow. – Bharath M Shetty Oct 30 '17 at 14:43