2

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:

  • Start time is less than or equal to "this row"'s start time
  • AND end time is greater than "this row"'s start time

So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.

I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.

  • I'm confused about what you want. Can you please be more specific? Suppose row_1 has start_time = 4 and end_time = 7, row_2 has start_time = 3 and end_time = 5, row_3 has start_time = 1 and end_time = 3, and row_4 has start_time = 2 and end_time = 8. What output do you want? – exp1orer Jul 14 '14 at 21:40
  • Just realized I misspoke above. It should count events which are still active as of this record's *start* time. So in your example, you would get this: Row_1: Start:4 End:7 Concurrent:3 | Row_2: Start:3 End:5 Concurrent:2 | Row_3: Start:1 End:3 Concurrent:1 | Row_4: Start:2 End:8 Concurrent:2 – user3838505 Jul 14 '14 at 21:51
  • So the question is "how many other rows were active when this entry started"? – exp1orer Jul 14 '14 at 21:55

3 Answers3

1

You can try below code to get the final result.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])

active_events= {}
for i in df.index:
    active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})

df.join(last_columns)
kimal
  • 682
  • 6
  • 5
1

Here goes. This is going to be SLOW.

Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)

import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics

def overlaps_with_row(row,frame):
    starts_before_mask = frame.start_time <= row.start_time
    ends_after_mask = frame.end_time > row.start_time
    return (starts_before_mask & ends_after_mask).sum()

df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)

Yields:

In [8]: df
Out[8]: 
   start_time  end_time  number_which_overlap
0           4         7                     3
1           3         5                     2
2           1         3                     1
3           2         8                     2

[4 rows x 3 columns]
exp1orer
  • 11,481
  • 7
  • 38
  • 51
  • Slow as in multiple minutes to run on a dataframe with 2M rows. – exp1orer Jul 14 '14 at 22:30
  • I ended up doing this: `for i in a.index: active[i] = len(a[(a['Start'] <= a.loc[i,'Start']) & (a['End'] > a.loc[i,'Start'])])`. Do you think this is slower, faster, or comparable in speed? – user3838505 Jul 14 '14 at 22:41
0
def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()

df["count"] = df.apply(counter , axis = 1)

This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.

supercooler8
  • 503
  • 2
  • 7