For each row in Dataframe test all other rows for a condition

Question

I have a table like following

df = pandas.DataFrame(
[[1,datetime.datetime(2018, 1, 1), datetime.datetime(2019, 1, 1),'Joe'], 
 [2,datetime.datetime(1999, 1, 1), datetime.datetime(2019, 1, 1),'James'],
 [3,datetime.datetime(1980, 1, 1), datetime.datetime(2000, 1, 1),'Jack'],
 [4,datetime.datetime(1967, 1, 1), datetime.datetime(1975, 1, 1),'Jim']],
columns=['PERSON ID', 'START DATE', 'END DATE','NAME']

I want get a list of colleagues and count them for each person based on their start and end date. Expected output is

    Name Number_of_colleagues List_of_colleagues
    Joe 1 [James]
    James 2 [Joe, Jack]
    Jack 1 [James]
    Jim 0 []

Any recommendation how to do it? I have tried to use nested for loop to iterate over each row to find it. It works but is really slow for 20000 rows.

what's the aggregate conditions? I don't fully understand the output condition — gold_cy, Dec 01 '19 at 20:55
For each employee if their time spent in the office indicated by the start date and end date overlaps with any other employee then they are colleagues. In such case increase the count and add the name to the list of colleagues. — Prithis, Dec 01 '19 at 21:01
you can sort by start date and for each row you stop checking when the start date of the next row is after the end date of the current row — Diego Marin Santos, Dec 01 '19 at 21:05
Does this answer your question? [Find date range overlap in python](https://stackoverflow.com/questions/42462218/find-date-range-overlap-in-python) — AMC, Dec 01 '19 at 21:14

Scott Boston · Answer 1 · 2019-12-02T01:41:36.497

With pandas 0.25.3 you can use pd.IntervalIndex.overlaps:

df.index = pd.IntervalIndex.from_arrays(df['START DATE'], df['END DATE'])

df['Num of Colleagues'] = [sum(df.index.overlaps(i))-1 for i in df.index]
print(df)

Output:

                          PERSON ID START DATE   END DATE   NAME  Num of Colleagues
(2018-01-01, 2019-01-01]          1 2018-01-01 2019-01-01    Joe                  1
(1999-01-01, 2019-01-01]          2 1999-01-01 2019-01-01  James                  2
(1980-01-01, 2000-01-01]          3 1980-01-01 2000-01-01   Jack                  1
(1967-01-01, 1975-01-01]          4 1967-01-01 1975-01-01    Jim                  0

And, we can add a list of colleauges like this:

df['List of colleagues'] = [df.loc[df.index.overlaps(i), 'NAME'].tolist() for i in df.index]

Output:

                          PERSON ID START DATE   END DATE   NAME  Num of Colleagues  List of colleagues
(2018-01-01, 2019-01-01]          1 2018-01-01 2019-01-01    Joe                  1        [Joe, James]
(1999-01-01, 2019-01-01]          2 1999-01-01 2019-01-01  James                  2  [Joe, James, Jack]
(1980-01-01, 2000-01-01]          3 1980-01-01 2000-01-01   Jack                  1       [James, Jack]
(1967-01-01, 1975-01-01]          4 1967-01-01 1975-01-01    Jim                  0               [Jim]

And, eliminated "self" on that row:

df['List of colleagues'] = [df.loc[df.index.overlaps(i) & (df.index != i), 'NAME'].tolist()  for i in df.index]

Output:

                          PERSON ID START DATE   END DATE   NAME  Num of Colleagues List of colleagues
(2018-01-01, 2019-01-01]          1 2018-01-01 2019-01-01    Joe                  1            [James]
(1999-01-01, 2019-01-01]          2 1999-01-01 2019-01-01  James                  2        [Joe, Jack]
(1980-01-01, 2000-01-01]          3 1980-01-01 2000-01-01   Jack                  1            [James]
(1967-01-01, 1975-01-01]          4 1967-01-01 1975-01-01    Jim                  0                 []

For each row in Dataframe test all other rows for a condition

1 Answers1