Pandas: Count time interval intersections over a group by

Question

I have a dataframe of the following form

import pandas as pd

Out[1]:
df = pd.DataFrame({'id':[1,2,3,4,5],
          'group':['A','A','A','B','B'],
          'start':['2012-08-19','2012-08-22','2013-08-19','2012-08-19','2013-08-19'],
          'end':['2012-08-28','2013-09-13','2013-08-19','2012-12-19','2014-08-19']})

     id group       start         end
0   1     A  2012-08-19  2012-08-28
1   2     A  2012-08-22  2013-09-13
2   3     A  2013-08-19  2013-08-21
3   4     B  2012-08-19  2012-12-19
4   5     B  2013-08-19  2014-08-19

For given row in my dataframe I'd like to count the number of items in the same group that have an overlapping time interval.

For example in group A id 2 ranges from 22 August 2012 to 13 Sept 2013 and hence the overlap between id 1 (19 August 2012 to 28 August 2012) and also id 3 (19 August 2013 to 21 August 2013) for a count of 2.

Conversely there is no overlap between the items in group B

So for my example dataframe above i'd like to produce something like

Out[2]:
   id group       start         end  count
0   1     A  2012-08-19  2012-08-28      1
1   2     A  2012-08-22  2013-09-13      2
2   3     A  2013-08-19  2013-08-21      1
3   4     B  2012-08-19  2012-12-19      0
4   5     B  2013-08-19  2014-08-19      0

I could "brute-force" this but I'd like to know if there is a more efficient Pandas way of getting this done.

Thanks in advance for your help

Can you elaborate a bit about `intersecting time interval`. I mean explain how you got count — Bharath M Shetty, Oct 19 '17 at 15:35

Andy Hayden · Answer 1 · 2017-10-19T16:35:26.653

So, I would see how brute force fairs... if it's slow I'd cythonize this logic. It's not so bad, as whilst O(M^2) in group size, if there's lots of small groups it might not be so bad.

In [11]: def interval_overlaps(a, b):
    ...:     return min(a["end"], b["end"]) - max(a["start"], b["start"]) > np.timedelta64(-1)


In [12]: def count_overlaps(df1):
    ...:     return sum(interval_overlaps(df1.iloc[i], df1.iloc[j]) for i in range(len(df1) - 1) for j in range(i, len(df1)) if i < j)

In [13]: df.groupby("group").apply(count_overlaps)
Out[13]:
group
A    2
B    0
dtype: int64

The former is a tweaking of this interval overlap function.

Edit: Upon re-reading it looks like the count_overlaps is per-row, rather than per-group, so the agg function should be more like:

In [21]: def count_overlaps(df1):
    ...:     return pd.Series([df1.apply(lambda x: interval_overlaps(x, df1.iloc[i]), axis=1).sum() - 1 for i in range(len(df1))], df1.index)

In [22]: df.groupby("group").apply(count_overlaps)
Out[22]:
group
A      0    1
       1    2
       2    1
B      3    0
       4    0
dtype: int64

In [22]: df["count"] = df.groupby("group").apply(count_overlaps).values

In [23]: df
Out[23]:
         end group  id      start  count
0 2012-08-28     A   1 2012-08-19      1
1 2013-09-13     A   2 2012-08-22      2
2 2013-08-19     A   3 2013-08-19      1
3 2012-12-19     B   4 2012-08-19      0
4 2014-08-19     B   5 2013-08-19      0

I'm not sure why, but I was unable to write this apply as a transform... (I think that it's not taking the fast path as it's a python object rather than a numpy/pandas one) — Andy Hayden, Oct 19 '17 at 16:23
@Bharathshetty I think I misread the question, the OPs example doesn't necessitate a transform. But if we were counting the overlaps within each group it's going to spread that information back into the original DataFrame, which is often a nice place to have it. — Andy Hayden, Oct 19 '17 at 16:26

score 2 · Answer 2 · answered Oct 19 '17 at 15:56

2

"brute-force"ish but gets the job done:

First converted the date strings to dates and then compared each row against the df with an apply.

df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)

df['count'] = df.apply(lambda row: len(df[ ( ( (row.start <= df.start) & (df.start <= row.end) ) \
                                            | ( (df.start <= row.start) & (row.start <= df.end) ) )
                           & (row.id != df.id) & (row.group == df.group) ]),axis=1)

answered Oct 19 '17 at 15:56

Nathan H

336
1
10

note: the to_datetime should not be necessary with your format, it was just habit to convert those – Nathan H Oct 19 '17 at 16:00
Good one even I was thinking the same. Though so many conditions – Bharath M Shetty Oct 19 '17 at 16:00
You could get rid of "& (row.id != df.id)" and do "len(...)-1" but not sure how we could reduce the intersection conditions within an apply – Nathan H Oct 19 '17 at 16:08
You may try groupby apply. My heads messed up now. Could n't think much about intersection of ranges. One vs all always hinders my thinking level. – Bharath M Shetty Oct 19 '17 at 16:10

score 1 · Answer 3 · edited Oct 19 '17 at 16:49

1

import datetime
def ol(a, b):
    l=[]
    for x in b:
        l.append(max(0, int(min(a[1], x[1]) - max(a[0], x[0])>=datetime.timedelta(minutes=0))))
    return sum(l)


df['New']=list(zip(df.start,df.end))
df['New2']=df.group.map(df.groupby('group').New.apply(list))
df.apply(lambda x : ol(x.New,x.New2),axis=1)-1

Out[495]: 
0    1
1    2
2    1
3    0
4    0
dtype: int64

Timings

#My method 
df.apply(lambda x : ol(x.New,x.New2),axis=1)-1

100 loops, best of 3: 5.39 ms per loop

#@Andy's Method 
df.groupby("group").apply(count_overlaps)    
10 loops, best of 3: 23.5 ms per loop

#@Nathan's Method

df.apply(lambda row: len(df[ ( ( (row.start <= df.start) & (df.start <= row.end) ) \
                       | ( (df.start <= row.start) & (row.start <= df.end) ) )
                       & (row.id != df.id) & (row.group == df.group) ]),axis=1)

10 loops, best of 3: 25.8 ms per loop

edited Oct 19 '17 at 16:49

Bharath M Shetty

30,075
6
57
108

answered Oct 19 '17 at 16:19

BENY

317,841
20
164
234

You took it far. I was using zip inside apply I messed up. This is neat. – Bharath M Shetty Oct 19 '17 at 16:21
1

@Bharathshetty when I using `apply` I always make all perpetration process outside the `lambda` , Cause I know I will messed up ... – BENY Oct 19 '17 at 16:23
1

you know urs is still the fastest. You can make add timeit. If you use numba it will be even faster. – Bharath M Shetty Oct 19 '17 at 16:40
@Bharathshetty man , seems like you already did testing time , feel free modify it :-) – BENY Oct 19 '17 at 16:44

Pandas: Count time interval intersections over a group by

3 Answers3

Linked