I need to run a function on a large groupby
query that checks whether two subGroups have any overlapping dates. Below is an example of a single group tmp
:
ID num start stop subGroup
0 21 10 2006-10-10 2008-10-03 1
1 21 46 2006-10-10 2100-01-01 2
2 21 5 1997-11-25 1998-09-29 1
3 21 42 1998-09-29 2100-01-01 2
4 21 3 1997-01-07 1997-11-25 1
5 21 6 2006-10-10 2008-10-03 1
6 21 47 1998-09-29 2006-10-10 2
7 21 4 1997-01-07 1998-09-29 1
The function I wrote to do this looks like this:
def hasOverlap(tmp):
d2_starts = tmp[tmp['subGroup']==2]['start']
d2_stops = tmp[tmp['subGroup']==2]['stop']
return tmp[tmp['subGroup']==1].apply(lambda row_d1:
(
#Check for part nested D2 in D1
((d2_starts >= row_d1['start']) &
(d2_starts < row_d1['stop']) ) |
((d2_stops >= row_d1['start']) &
(d2_stops < row_d1['stop']) ) |
#Check for fully nested D1 in D2
((d2_stops >= row_d1['stop']) &
(d2_starts <= row_d1['start']) )
).any()
,axis = 1
).any()
The problem is that this code has many redundancies and when I run the query:
groups.agg(hasOverlap)
It takes an unreasonably long time to terminate.
Are there any performance fixes (such as using built-in functions or set_index) that I could do to speed this up?