0

I have a table like following

df = pandas.DataFrame(
[[1,datetime.datetime(2018, 1, 1), datetime.datetime(2019, 1, 1),'Joe'], 
 [2,datetime.datetime(1999, 1, 1), datetime.datetime(2019, 1, 1),'James'],
 [3,datetime.datetime(1980, 1, 1), datetime.datetime(2000, 1, 1),'Jack'],
 [4,datetime.datetime(1967, 1, 1), datetime.datetime(1975, 1, 1),'Jim']],
columns=['PERSON ID', 'START DATE', 'END DATE','NAME']

I want get a list of colleagues and count them for each person based on their start and end date. Expected output is

    Name Number_of_colleagues List_of_colleagues
    Joe 1 [James]
    James 2 [Joe, Jack]
    Jack 1 [James]
    Jim 0 []

Any recommendation how to do it? I have tried to use nested for loop to iterate over each row to find it. It works but is really slow for 20000 rows.

Prithis
  • 1,033
  • 4
  • 14
  • 22
  • what's the aggregate conditions? I don't fully understand the output condition – gold_cy Dec 01 '19 at 20:55
  • For each employee if their time spent in the office indicated by the start date and end date overlaps with any other employee then they are colleagues. In such case increase the count and add the name to the list of colleagues. – Prithis Dec 01 '19 at 21:01
  • you can sort by start date and for each row you stop checking when the start date of the next row is after the end date of the current row – Diego Marin Santos Dec 01 '19 at 21:05
  • Does this answer your question? [Find date range overlap in python](https://stackoverflow.com/questions/42462218/find-date-range-overlap-in-python) – AMC Dec 01 '19 at 21:14

1 Answers1

1

With pandas 0.25.3 you can use pd.IntervalIndex.overlaps:

df.index = pd.IntervalIndex.from_arrays(df['START DATE'], df['END DATE'])

df['Num of Colleagues'] = [sum(df.index.overlaps(i))-1 for i in df.index]
print(df)

Output:

                          PERSON ID START DATE   END DATE   NAME  Num of Colleagues
(2018-01-01, 2019-01-01]          1 2018-01-01 2019-01-01    Joe                  1
(1999-01-01, 2019-01-01]          2 1999-01-01 2019-01-01  James                  2
(1980-01-01, 2000-01-01]          3 1980-01-01 2000-01-01   Jack                  1
(1967-01-01, 1975-01-01]          4 1967-01-01 1975-01-01    Jim                  0

And, we can add a list of colleauges like this:

df['List of colleagues'] = [df.loc[df.index.overlaps(i), 'NAME'].tolist() for i in df.index]

Output:

                          PERSON ID START DATE   END DATE   NAME  Num of Colleagues  List of colleagues
(2018-01-01, 2019-01-01]          1 2018-01-01 2019-01-01    Joe                  1        [Joe, James]
(1999-01-01, 2019-01-01]          2 1999-01-01 2019-01-01  James                  2  [Joe, James, Jack]
(1980-01-01, 2000-01-01]          3 1980-01-01 2000-01-01   Jack                  1       [James, Jack]
(1967-01-01, 1975-01-01]          4 1967-01-01 1975-01-01    Jim                  0               [Jim]

And, eliminated "self" on that row:

df['List of colleagues'] = [df.loc[df.index.overlaps(i) & (df.index != i), 'NAME'].tolist()  for i in df.index]

Output:

                          PERSON ID START DATE   END DATE   NAME  Num of Colleagues List of colleagues
(2018-01-01, 2019-01-01]          1 2018-01-01 2019-01-01    Joe                  1            [James]
(1999-01-01, 2019-01-01]          2 1999-01-01 2019-01-01  James                  2        [Joe, Jack]
(1980-01-01, 2000-01-01]          3 1980-01-01 2000-01-01   Jack                  1            [James]
(1967-01-01, 1975-01-01]          4 1967-01-01 1975-01-01    Jim                  0                 []
Scott Boston
  • 147,308
  • 15
  • 139
  • 187