3

I need to run a function on a large groupby query that checks whether two subGroups have any overlapping dates. Below is an example of a single group tmp:

    ID  num start       stop        subGroup
0   21  10  2006-10-10  2008-10-03  1
1   21  46  2006-10-10  2100-01-01  2
2   21  5   1997-11-25  1998-09-29  1
3   21  42  1998-09-29  2100-01-01  2
4   21  3   1997-01-07  1997-11-25  1
5   21  6   2006-10-10  2008-10-03  1
6   21  47  1998-09-29  2006-10-10  2
7   21  4   1997-01-07  1998-09-29  1

The function I wrote to do this looks like this:

def hasOverlap(tmp):
    d2_starts = tmp[tmp['subGroup']==2]['start']
    d2_stops = tmp[tmp['subGroup']==2]['stop']
    return tmp[tmp['subGroup']==1].apply(lambda row_d1:
         (
            #Check for part nested D2 in D1
            ((d2_starts >= row_d1['start']) &
             (d2_starts < row_d1['stop']) ) |
            ((d2_stops >= row_d1['start']) &
             (d2_stops < row_d1['stop']) ) |
            #Check for fully nested D1 in D2
            ((d2_stops >= row_d1['stop']) &
             (d2_starts <= row_d1['start']) )
         ).any()
         ,axis = 1
        ).any()

The problem is that this code has many redundancies and when I run the query:

groups.agg(hasOverlap)

It takes an unreasonably long time to terminate.

Are there any performance fixes (such as using built-in functions or set_index) that I could do to speed this up?

TimY
  • 5,256
  • 5
  • 44
  • 57

1 Answers1

0

Are you just looking to return "True" or "False" based on the presence of an overlap? If so, I'd just get a list of the dates for each subgroup, and then uses pandas isin method to check if they overlap.

You could try something like this:

#split subgroups into separate DF's
group1 = groups[groups.subgroup==1]
group2 = groups[groups.subgroup==2]

#check if any of the start dates from group 2 are in group 1
if len(group1[group1.start.isin(list(group2.start))]) >0:
    print "Group1 overlaps group2"

#check if any of the start dates from group 1 are in group 2
if len(group2[group2.start.isin(list(group1.start))]) >0:
    print "Group2 overlaps group1"
Sam
  • 4,000
  • 20
  • 27