1

I have a dataframe with two columns that are datetime objects (time_a and time_b). I need to check on a row-by-row basis if the elements of time_a or time_b for such row, are contained within any of the other intervals defined by the other time_a and time_b rows. That's what I defined as 'overlap', if any period of work between time_a or time_b clashes partially with other intervals regardless of the room.

The way I managed to approach this initially was to create tuples with the data of time_a and time_b, and then checking on a row-by-row basis if time_a or time_b fell within any the range of any of these tuples.

That approach seemed convoluted, so I wanted to explore the power of Pandas for such purpose. Using this great question as example, I tried adapting it to my problem, using a dataframe named test_2 (columns are date, room, time_a, time_b, personnel_number) whilst test_3 only has time_a, time_b columns. I wrote my partial solution like this:

any_in_range = lambda row, iterable: any(
    [(x > row[2]) & (x < row[3]) for x in iterable])
test_2['label_1'] = test_2.apply(any_in_range, iterable=test_3['time_case_finished'], axis=1)
test_2['label_2'] = test_2.apply(any_in_range, iterable=test_3['time_finished_cleaning'], axis=1)
test_2['isOverlap'] = np.where((test_2['label_1'] == True) | (test_2['label_2'] == True), 1, 0)
final_overlap = test_2[test_2['isOverlap'] == 1]

And a sample of the outcome, is described below:

    date    room    time_a  time_b  personnel_number    label_1 label_2 isOverlap
77  2021-09-14  3   2021-09-14 12:01:42-07:00   2021-09-14 12:12:20-07:00   1   False   False   0
80  2021-09-14  1   2021-09-14 13:15:36-07:00   2021-09-14 13:24:50-07:00   1   False   False   0
83  2021-09-14  1   2021-09-14 14:21:52-07:00   2021-09-14 14:39:37-07:00   1   True    False   1
84  2021-09-14  3   2021-09-14 14:38:58-07:00   2021-09-14 14:52:24-07:00   1   True    True    1
90  2021-09-15  4   2021-09-15 09:25:11-07:00   2021-09-15 09:53:33-07:00   1   True    True    1
91  2021-09-15  5   2021-09-15 09:28:30-07:00   2021-09-15 09:42:25-07:00   1   False   False   0
92  2021-09-15  1   2021-09-15 09:52:18-07:00   2021-09-15 10:07:25-07:00   1   True    True    1
93  2021-09-15  3   2021-09-15 10:02:05-07:00   2021-09-15 10:20:13-07:00   1   False   True    1

Now, notice how row 90 is marked as 1, but my code fails to find the other row in which it is supposed to be overlapping (which should be row 91, marking a 0). The overlap is not total, even if just a minute, I still want to count it in as overlap, but my code is not fulfilling the purpose for every case in my dataset.

Any help or advice is dearly appreciated.

Aquiles Páez
  • 503
  • 6
  • 18
  • kindly provide a sample dataframe, possibly ten rows or thereabout as a dictionary, so it is reproducible ``df.to_dict('list')`` – sammywemmy Apr 23 '22 at 23:26
  • Are 'time_a' and 'time_b' allowed to overlap if they are on the same line? – Nev1111 Apr 23 '22 at 23:35
  • @sammywemmy hi, the original dataframe is the same as the sample outcome, but only has the first 4 columns (that is .loc[:, 0:4]), the outcome only adds information to it. – Aquiles Páez Apr 24 '22 at 22:26
  • @Nev1111 hmmm, you mean, using logic conditionals for closed intervals such as '>=' or '<='? If I understood you correctly, that will mark every row as an overlap. Unless you mean something else. – Aquiles Páez Apr 24 '22 at 22:28

1 Answers1

5

The problem seems to boil down to finding overlapping intervals, where the intervals are defined by time_a and time_b

This can be efficiently solved with the piso (pandas interval set operations) package, in particular the adjacency_matrix method

import pandas as pd
import piso

ii = pd.IntervalIndex.from_arrays(df["time_a"], df["time_b"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values

note: I am the creator of piso. Please feel free to reach out with feedback or questions if you have any.

Riley
  • 2,153
  • 1
  • 6
  • 16
  • This was so elegant and efficient, obviously my solution with the tuples was not going to be efficient, specially iterating over the entire search space for each and single row. Also, I didn't knew about this piso library, first time I heard of it, and now I'll research it thanks to you. Would give you more points if I could for this answer, thank you! – Aquiles Páez Apr 24 '22 at 22:23
  • @Riley I have a similar task however for two overlapping instances I would like to choose the smaller start time and larger end time. Can this also be achieved using piso? – Marc Dec 07 '22 at 14:58
  • 1
    @Marc perhaps [piso.union](https://piso.readthedocs.io/en/latest/reference/api/piso.union.html) will solve your problem? – Riley Dec 08 '22 at 11:42
  • 1
    @Marc if you post it as a question then let me know – Riley Dec 08 '22 at 11:43